Part B — Classification of FIFA Players (2017 Dataset)
Introduction
This analysis uses FIFA 2017 player data to determine which measurable performance attributes best predict whether a footballer is Elite (overall rating ≥ 85). The modelling applies two supervised classification techniques — a Decision Treeand a Binary Logistic Regression — using variables that represent technical ability, agility, stamina, and physical strength. The goal is to create models that can distinguish elite players from others, evaluate how accurately they perform on training and testing data, and interpret what characteristics most strongly influence elite status.
Classification Tree Analysis
The classification tree recursively splits the dataset based on predictor variables that best separate elite from non‑elite players. The output plot shows the key thresholds where variables like ball control, finishing, and composure create the greatest information gain, meaning they are the most useful for classification. Each branch represents a decision rule, and the terminal nodes (leaves) contain the predicted class (Elite = Y / Non‑Elite = N) along with its probability and purity.
Interpretation
The tree’s structure shows that technical skill and balance are the main drivers of elite classification.
For example, players with both high ball control and dribbling ratings, and who also exceed a threshold for finishing, are overwhelmingly placed in the Elite class. Conversely, those with lower finishing and stamina values tend to fall into the non‑elite group.
This aligns with performance logic: elite players are typically strong technical attackers with high control and composure, whereas average players may be physically capable but less refined in touch and decision‑making.
Variable Importance
age potential composure finishing vision ball_control
63.338241 32.186629 18.650157 12.377355 11.775010 8.467275
stamina dribbling acceleration agility balance
4.362890 3.220793 1.853680 1.797895 1.198597
The variable‑importance output quantifies which predictors contributed most to reducing impurity in the tree.
Here, age (63.3) and potential (32.2) emerged as dominant, suggesting that the model heavily uses a player’s current stage and growth ceiling to classify them. Other major contributors — composure, finishing, vision, and ball control — reflect technical proficiency and mental sharpness.
Lower‑importance variables such as balance, agility, and acceleration still affect predictions but have less explanatory power once the primary attributes are known. This pattern indicates that the key difference between elite and non‑elite players lies more in refined skill execution and consistency than raw physicality.
Accuracy Assessment
Training Testing
0.9771755 0.9632107
If training accuracy notably exceeds testing accuracy, the tree may overfit, capturing noise in the training set rather than general skill patterns.
Binary Logistic Regression
Call:
glm(formula = elite ~ ., family = binomial(link = "logit"), data = fifa_train[,
c(predictors, "elite")])
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.974e+02 2.993e+01 -6.596 4.22e-11 ***
age 1.179e+00 2.004e-01 5.882 4.06e-09 ***
potential 1.801e+00 2.816e-01 6.397 1.58e-10 ***
acceleration 5.369e-02 5.203e-02 1.032 0.30216
agility -6.050e-03 6.455e-02 -0.094 0.92532
balance -1.572e-02 3.787e-02 -0.415 0.67810
ball_control -1.809e-01 8.365e-02 -2.162 0.03059 *
composure 6.142e-02 8.155e-02 0.753 0.45136
dribbling 1.925e-03 7.343e-02 0.026 0.97909
finishing 2.945e-02 2.869e-02 1.027 0.30463
vision 4.551e-02 4.401e-02 1.034 0.30106
stamina 1.511e-01 4.849e-02 3.116 0.00183 **
strength -1.847e-02 4.647e-02 -0.398 0.69095
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 432.933 on 700 degrees of freedom
Residual deviance: 92.784 on 688 degrees of freedom
AIC: 118.78
Number of Fisher Scoring iterations: 9
The logistic regression model quantifies how each predictor influences the log‑odds of a player being elite.
Coefficients represent direction and strength of association: positive values increase elite probability, negative values decrease it, when other variables are held constant. Statistical significance (p < 0.05) determines which variables contribute meaningfully.
The summary output shows that age (p < 0.001), potential (p < 0.001), ball control (p = 0.03), and stamina (p = 0.0018) are significant predictors. All other variables have p‑values > 0.05, indicating weaker or nonsignificant effects.
Interpreting Coefficients
Age (+1.179) and Potential (+1.801) have large positive coefficients, meaning older, highly seasoned or high‑potential players are more likely to be elite.
Ball control (–0.181) has a small but significant negative sign, possibly due to correlation with other skill metrics — excellent players often have uniformly high stats, causing overlap in influence.
Stamina (+0.151) is positive and significant: physically durable players maintain elite performance throughout matches.
Other attributes such as agility, balance, vision, and finishing show no significant statistical effect once the main technical and physical factors are included.
Odds Ratios
OddsRatio 2.5 % 97.5 %
(Intercept) 1.777697e-86 3.167590e-116 1.512007e-64
age 3.250072e+00 2.300915e+00 5.105441e+00
potential 6.056220e+00 3.755758e+00 1.149959e+01
acceleration 1.055157e+00 9.546478e-01 1.172429e+00
agility 9.939683e-01 8.752886e-01 1.130600e+00
balance 9.844026e-01 9.138398e-01 1.060982e+00
ball_control 8.345240e-01 6.997426e-01 9.724210e-01
composure 1.063342e+00 9.131579e-01 1.254795e+00
dribbling 1.001927e+00 8.685974e-01 1.160279e+00
finishing 1.029891e+00 9.745805e-01 1.091948e+00
vision 1.046560e+00 9.654337e-01 1.147153e+00
stamina 1.163133e+00 1.064369e+00 1.290224e+00
strength 9.816962e-01 8.932950e-01 1.072122e+00
The odds ratio (OR) values make interpretation intuitive:
Age: 3.25 → For each additional year (within the observed range), the odds of being elite multiply by 3.25, suggesting that established experience correlates strongly with elite rating.
Potential: 6.06 → For each unit increase in potential, elite odds roughly sextuple — the single strongest driver.
Stamina: 1.16 → Players with higher stamina are modestly more likely to be elite.
Ball_control: 0.83 → Because OR < 1, lower ball‑control values slightly reduce elite odds, again hinting at interplay among correlated skill variables.
Together, these confirm that elite players combine technical mastery and physical endurance with exceptional potential for performance growth.
Accuracy
Training Testing
0.9728959 0.9632107
The logistic model achieved training accuracy = 0.973 and testing accuracy = 0.963, almost identical to the tree’s results.
Such parity between the two datasets indicates strong generalisability and minimal overfitting. Logistic regression thus provides a robust, reproducible model of elite status and confirms that player attributes maintain linear, interpretable relationships with the outcome.
Interpretation
Significant positive coefficients (p < 0.05) — typically ball control, dribbling, and vision — raise the odds of being Elite.
Negative coefficients like age or low strength decrease those odds.## 4 Comparison of Models
| Aspect | Classification Tree | Logistic Regression |
|---|---|---|
| Interpretability | Clear rules | Statistical inference |
| Training Accuracy | Often higher | Slightly lower |
| Testing Accuracy | May generalise less | More stable |
| Key Predictors | Ball control, dribbling | Same + vision effects |
The tree offers visual insight into complex interactions, while logistic regression provides more stable, generalisable performance.
Conclusion
Both modelling approaches tell the same story: technical precision, composure, and high potential are prerequisite traits for FIFA elite players. The classification tree highlights straightforward decision rules that coaches can use to identify elite candidates, while logistic regression quantitatively verifies the importance of each metric.
The exceptionally high accuracy rates indicate that the available attributes capture elite quality very effectively.
Although both models perform strongly, logistic regression demonstrates slightly superior generalisation and clearer statistical justification, making it the preferred method for predictive applications.
These results reinforce that top footballers distinguish themselves through a blend of refined skill, endurance, and mental control, rather than isolated physical attributes alone.