Part B — Classification of FIFA Players (2017 Dataset)

Author

Dylan Lynch

Published

April 24, 2026

 Introduction

This analysis uses FIFA 2017 player data to determine which measurable performance attributes best predict whether a footballer is Elite (overall rating ≥ 85). The modelling applies two supervised classification techniques — a Decision Treeand a Binary Logistic Regression — using variables that represent technical ability, agility, stamina, and physical strength. The goal is to create models that can distinguish elite players from others, evaluate how accurately they perform on training and testing data, and interpret what characteristics most strongly influence elite status.

 Classification  Tree  Analysis

The classification tree recursively splits the dataset based on predictor variables that best separate elite from non‑elite players. The output plot shows the key thresholds where variables like ball controlfinishing, and composure create the greatest information gain, meaning they are the most useful for classification. Each branch represents a decision rule, and the terminal nodes (leaves) contain the predicted class (Elite = Y / Non‑Elite = N) along with its probability and purity.

 Interpretation

The tree’s structure shows that technical skill and balance are the main drivers of elite classification.
For example, players with both high ball control and dribbling ratings, and who also exceed a threshold for finishing, are overwhelmingly placed in the Elite class. Conversely, those with lower finishing and stamina values tend to fall into the non‑elite group.
This aligns with performance logic: elite players are typically strong technical attackers with high control and composure, whereas average players may be physically capable but less refined in touch and decision‑making.

Variable  Importance

         age    potential    composure    finishing       vision ball_control 
   63.338241    32.186629    18.650157    12.377355    11.775010     8.467275 
     stamina    dribbling acceleration      agility      balance 
    4.362890     3.220793     1.853680     1.797895     1.198597 

The variable‑importance output quantifies which predictors contributed most to reducing impurity in the tree.
Here, age (63.3) and potential (32.2) emerged as dominant, suggesting that the model heavily uses a player’s current stage and growth ceiling to classify them. Other major contributors — composure, finishing, vision, and ball control — reflect technical proficiency and mental sharpness.
Lower‑importance variables such as balanceagility, and acceleration still affect predictions but have less explanatory power once the primary attributes are known. This pattern indicates that the key difference between elite and non‑elite players lies more in refined skill execution and consistency than raw physicality.

 Accuracy  Assessment

 Training   Testing 
0.9771755 0.9632107 

If training accuracy notably exceeds testing accuracy, the tree may overfit, capturing noise in the training set rather than general skill patterns.

 Binary  Logistic  Regression


Call:
glm(formula = elite ~ ., family = binomial(link = "logit"), data = fifa_train[, 
    c(predictors, "elite")])

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.974e+02  2.993e+01  -6.596 4.22e-11 ***
age           1.179e+00  2.004e-01   5.882 4.06e-09 ***
potential     1.801e+00  2.816e-01   6.397 1.58e-10 ***
acceleration  5.369e-02  5.203e-02   1.032  0.30216    
agility      -6.050e-03  6.455e-02  -0.094  0.92532    
balance      -1.572e-02  3.787e-02  -0.415  0.67810    
ball_control -1.809e-01  8.365e-02  -2.162  0.03059 *  
composure     6.142e-02  8.155e-02   0.753  0.45136    
dribbling     1.925e-03  7.343e-02   0.026  0.97909    
finishing     2.945e-02  2.869e-02   1.027  0.30463    
vision        4.551e-02  4.401e-02   1.034  0.30106    
stamina       1.511e-01  4.849e-02   3.116  0.00183 ** 
strength     -1.847e-02  4.647e-02  -0.398  0.69095    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 432.933  on 700  degrees of freedom
Residual deviance:  92.784  on 688  degrees of freedom
AIC: 118.78

Number of Fisher Scoring iterations: 9

The logistic regression model quantifies how each predictor influences the log‑odds of a player being elite.
Coefficients represent direction and strength of association: positive values increase elite probability, negative values decrease it, when other variables are held constant. Statistical significance (p < 0.05) determines which variables contribute meaningfully.
The summary output shows that age (p < 0.001), potential (p < 0.001), ball control (p = 0.03), and stamina (p = 0.0018) are significant predictors. All other variables have p‑values > 0.05, indicating weaker or nonsignificant effects.

Interpreting Coefficients

  • Age (+1.179) and Potential (+1.801) have large positive coefficients, meaning older, highly seasoned or high‑potential players are more likely to be elite.

  • Ball control (–0.181) has a small but significant negative sign, possibly due to correlation with other skill metrics — excellent players often have uniformly high stats, causing overlap in influence.

  • Stamina (+0.151) is positive and significant: physically durable players maintain elite performance throughout matches.

  • Other attributes such as agility, balance, vision, and finishing show no significant statistical effect once the main technical and physical factors are included.

 Odds  Ratios

                OddsRatio         2.5 %       97.5 %
(Intercept)  1.777697e-86 3.167590e-116 1.512007e-64
age          3.250072e+00  2.300915e+00 5.105441e+00
potential    6.056220e+00  3.755758e+00 1.149959e+01
acceleration 1.055157e+00  9.546478e-01 1.172429e+00
agility      9.939683e-01  8.752886e-01 1.130600e+00
balance      9.844026e-01  9.138398e-01 1.060982e+00
ball_control 8.345240e-01  6.997426e-01 9.724210e-01
composure    1.063342e+00  9.131579e-01 1.254795e+00
dribbling    1.001927e+00  8.685974e-01 1.160279e+00
finishing    1.029891e+00  9.745805e-01 1.091948e+00
vision       1.046560e+00  9.654337e-01 1.147153e+00
stamina      1.163133e+00  1.064369e+00 1.290224e+00
strength     9.816962e-01  8.932950e-01 1.072122e+00

The odds ratio (OR) values make interpretation intuitive:

  • Age: 3.25 → For each additional year (within the observed range), the odds of being elite multiply by 3.25, suggesting that established experience correlates strongly with elite rating.

  • Potential: 6.06 → For each unit increase in potential, elite odds roughly sextuple — the single strongest driver.

  • Stamina: 1.16 → Players with higher stamina are modestly more likely to be elite.

  • Ball_control: 0.83 → Because OR < 1, lower ball‑control values slightly reduce elite odds, again hinting at interplay among correlated skill variables.
    Together, these confirm that elite players combine technical mastery and physical endurance with exceptional potential for performance growth.

 Accuracy

 Training   Testing 
0.9728959 0.9632107 

The logistic model achieved training accuracy = 0.973 and testing accuracy = 0.963, almost identical to the tree’s results.
Such parity between the two datasets indicates strong generalisability and minimal overfitting. Logistic regression thus provides a robust, reproducible model of elite status and confirms that player attributes maintain linear, interpretable relationships with the outcome.

 Interpretation

Significant positive coefficients (p < 0.05) — typically ball control, dribbling, and vision — raise the odds of being Elite.
Negative coefficients like age or low strength decrease those odds.## 4 Comparison of Models

Aspect Classification Tree Logistic Regression
Interpretability Clear rules Statistical inference
Training  Accuracy Often higher Slightly lower
Testing  Accuracy May generalise less More stable
Key Predictors Ball  control,  dribbling Same + vision  effects

The tree offers visual insight into complex interactions, while logistic regression provides more stable, generalisable performance.

 Conclusion

Both modelling approaches tell the same story: technical precision, composure, and high potential are prerequisite traits for FIFA elite players. The classification tree highlights straightforward decision rules that coaches can use to identify elite candidates, while logistic regression quantitatively verifies the importance of each metric.
The exceptionally high accuracy rates indicate that the available attributes capture elite quality very effectively.
Although both models perform strongly, logistic regression demonstrates slightly superior generalisation and clearer statistical justification, making it the preferred method for predictive applications.
These results reinforce that top footballers distinguish themselves through a blend of refined skill, endurance, and mental control, rather than isolated physical attributes alone.