In the world of football, success on the pitch is often synonymous with star-studded lineups and hefty player price tags. But what if building a winning team wasn’t just about breaking the bank? This project embarks on a data-driven exploration of FIFA 2017 player stats, challenging the conventional approach by identifying the most cost-effective players to form a championship-worthy squad.
Leveraging the power of R and the finesse of Linear Regression, we’ll dive into key performance metrics— dribbling, passing, defense, and more—to assess their influence on a player’s market value. Our analysis goes beyond mere numbers, seeking to uncover patterns that balance performance with budget constraints. The goal? To create a statistically optimized dream team, one that maximizes talent while minimizing costs, showcasing the hidden potential within the world of football economics.
This project will provide insights not only for statisticians and data enthusiasts but also for football managers and scouts looking to extract value from the beautiful game. By transforming raw data into actionable intelligence, we aim to redefine what it means to build the best team—not just in terms of skills, but also through a financially strategic lens. Welcome to the future of football analytics, where statistics meets strategy, and value is redefined.
To have a basic idea regarding the data-set we are working with, we shall look at its basic attributes
[1] "The basic datatypes including the name of the variables are as follows:-"
Rows: 17,560
Columns: 63
$ ID <int> 176580, 178518, 181872, 197445, 195864, 17373…
$ Name <chr> " L. Suárez", " R. Nainggolan", " A. Vidal", …
$ Age <int> 29, 28, 29, 24, 23, 26, 25, 31, 30, 31, 30, 3…
$ Photo <chr> "https://cdn.sofifa.com/players/176/580/17_60…
$ Nationality <chr> "Uruguay", "Belgium", "Chile", "Austria", "Fr…
$ Flag <chr> "https://cdn.sofifa.com/flags/uy.png", "https…
$ Overall <int> 92, 86, 87, 86, 88, 90, 82, 83, 89, 94, 86, 8…
$ Potential <int> 92, 86, 87, 89, 94, 90, 85, 83, 89, 94, 86, 8…
$ Club <chr> "FC Barcelona", "Roma", "FC Bayern München", …
$ Club.Logo <chr> "https://cdn.sofifa.com/teams/241/light_30.pn…
$ Value <chr> "€83M", "€37.5M", "€41.5M", "€41.5M", "€71.5M…
$ Wage <chr> "€525K", "€130K", "€180K", "€140K", "€225K", …
$ Special <int> 2291, 2290, 2285, 2279, 2271, 2266, 2264, 224…
$ Preferred.Foot <chr> "Right", "Right", "Right", "Left", "Right", "…
$ International.Reputation <dbl> 5, 3, 4, 4, 4, 4, 3, 3, 4, 5, 3, 3, 3, 4, 3, …
$ Weak.Foot <dbl> 4, 3, 4, 4, 4, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, …
$ Skill.Moves <dbl> 4, 3, 3, 3, 5, 4, 3, 3, 4, 5, 4, 3, 3, 3, 3, …
$ Work.Rate <chr> "High/ Medium", "High/ High", "High/ High", "…
$ Body.Type <chr> "Normal (170-185)", "Stocky (170-185)", "Norm…
$ Real.Face <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "No"…
$ Position <chr> "<span class=\"pos pos25\">ST", "<span class=…
$ Jersey.Number <dbl> 9, 4, 23, 27, 6, 11, 24, 3, 19, 7, 8, 33, 18,…
$ Joined <chr> "Jul 11, 2014", "Jan 7, 2014", "Jul 28, 2015"…
$ Loaned.From <chr> "nan", "nan", "nan", "nan", "nan", "nan", "na…
$ Contract.Valid.Until <chr> "2021", "2021", "2019", "2021", "2021", "2022…
$ Height <chr> "6'0", "5'9", "5'11", "5'11", "6'3", "6'0", "…
$ Weight <chr> "190lbs", "143lbs", "165lbs", "168lbs", "185l…
$ Crossing <dbl> 77, 73, 76, 82, 78, 87, 77, 88, 78, 84, 76, 8…
$ Finishing <dbl> 94, 76, 77, 63, 71, 87, 76, 70, 71, 93, 63, 6…
$ HeadingAccuracy <dbl> 77, 59, 81, 75, 73, 86, 70, 74, 55, 85, 68, 6…
$ ShortPassing <dbl> 83, 84, 84, 83, 85, 86, 82, 81, 92, 83, 87, 8…
$ Volleys <dbl> 88, 75, 78, 68, 84, 76, 87, 63, 74, 88, 76, 7…
$ Dribbling <dbl> 86, 80, 76, 79, 89, 89, 79, 78, 86, 92, 84, 7…
$ Curve <dbl> 86, 73, 76, 78, 84, 86, 73, 81, 79, 81, 78, 8…
$ FKAccuracy <dbl> 84, 68, 68, 83, 82, 85, 79, 82, 77, 76, 78, 8…
$ LongPassing <dbl> 64, 81, 82, 80, 88, 80, 75, 74, 83, 77, 85, 8…
$ BallControl <dbl> 91, 85, 83, 83, 90, 88, 83, 79, 92, 93, 85, 8…
$ Acceleration <dbl> 88, 81, 77, 86, 75, 93, 86, 76, 77, 91, 79, 6…
$ SprintSpeed <dbl> 77, 78, 74, 86, 79, 95, 84, 75, 71, 92, 79, 8…
$ Agility <dbl> 86, 81, 75, 82, 79, 77, 87, 75, 93, 90, 81, 7…
$ Reactions <dbl> 93, 87, 89, 84, 86, 87, 79, 80, 88, 96, 86, 7…
$ Balance <dbl> 60, 84, 77, 77, 61, 65, 84, 85, 94, 63, 79, 6…
$ ShotPower <dbl> 87, 84, 84, 83, 90, 91, 82, 84, 73, 92, 82, 8…
$ Jumping <dbl> 69, 76, 82, 82, 85, 85, 74, 72, 67, 95, 67, 6…
$ Stamina <dbl> 89, 94, 93, 87, 91, 78, 92, 86, 83, 92, 83, 8…
$ Strength <dbl> 80, 78, 79, 69, 91, 80, 68, 69, 58, 80, 71, 7…
$ LongShots <dbl> 86, 86, 85, 83, 87, 90, 82, 75, 82, 90, 82, 7…
$ Aggression <dbl> 78, 88, 91, 69, 72, 65, 79, 74, 62, 63, 78, 8…
$ Interceptions <dbl> 41, 86, 89, 85, 70, 59, 76, 80, 76, 29, 82, 8…
$ Positioning <dbl> 92, 88, 80, 77, 84, 86, 82, 75, 79, 94, 82, 7…
$ Vision <dbl> 84, 76, 80, 78, 87, 79, 75, 76, 90, 85, 84, 8…
$ Penalties <dbl> 85, 63, 84, 80, 76, 76, 76, 90, 80, 85, 78, 8…
$ Composure <dbl> 83, 85, 86, 79, 83, 85, 79, 81, 76, 86, 84, 8…
$ Marking <dbl> 30, 78, 76, 81, 68, 51, 73, 81, 66, 22, 72, 7…
$ StandingTackle <dbl> 45, 85, 89, 83, 73, 55, 79, 82, 80, 31, 78, 7…
$ SlidingTackle <dbl> 38, 88, 84, 83, 73, 52, 77, 82, 73, 23, 78, 7…
$ GKDiving <dbl> 27, 11, 4, 5, 5, 15, 9, 8, 13, 7, 4, 10, 13, …
$ GKHandling <dbl> 25, 11, 2, 7, 6, 15, 8, 15, 9, 11, 3, 10, 15,…
$ GKKicking <dbl> 31, 14, 4, 14, 2, 11, 10, 10, 7, 15, 4, 6, 13…
$ GKPositioning <dbl> 33, 8, 2, 15, 4, 5, 11, 13, 14, 14, 2, 6, 6, …
$ GKReflexes <dbl> 37, 11, 4, 9, 3, 6, 10, 12, 9, 11, 4, 12, 13,…
$ Best.Position <chr> "ST", "CDM", "CDM", "LB", "CAM", "ST", "RB", …
$ Best.Overall.Rating <dbl> 88, 84, 85, 84, 85, 88, 81, 81, 86, 91, 84, 8…
In this project our idea is to form the most cost-effective yet high potential team using Statistics.
Lets drop the columns not required in our Analysis i.e “Loaned From”
We shall deal with NULL/NA values in this data-set constructively as possible. The variable(s) that contains NULL / NA values are :-
[1] "Jersey.Number"
This column is not useful, so we choose to drop it. Hence, our data is free from NULL/NA Values.
Now in order to proceed we shall change the type of few columns : Wage, Value and Best Position, because they will be used in calculation and hence should go through some feature engineering to remove the unwanted parts.
We created a column “Category” that has 4 position, namely “Attacker”, “Defender”, “Midfielder”, “Goalkeeper” based on the Best.Position column.
Here is the distribution of the categories :-
Attacker Defender Goalkeeper Midfielder
3028 5913 1995 6624
For the Wage and Value column, we need to remove the € symbol and also convert K and M with suitable multiples.
Next, we shall convert the Height and Weight columns respectively to their numeric values
On basic scrutiny or birds eye view of the data-set confirms that only the numerical column(both original and newly formed) are required for the analysis, rest all columns other than those and Name shall be dropped from the data-set.
Our data is now ready for analysis, we shall begin with some visual analysis that includes basic plots and further we shall also try to check whether we can categories or form cluster based on the category column using PCA.
Here, we shall see whether using the numerical columns only, can we cluster the data-set into the categories we just formed.
We are using 3 Principal Components.
We shall color code the category column as :
• Attacker= “RED”
• Midfielder= “GREEN”
• Defender= “BLUE”
• Goalkeeper= “PURPLE”, This will help us to notice that whether clusters are formed if the dimension is reduced to 3.
Here is the PCA Plot using 3 Components :-
.
Clearly we can see that there are clusters in the data-set which synchronizes with the category column.
The goal of this project is to uncover hidden gems in football—players who are both undervalued and underpaid—using a data-driven approach. By leveraging linear regression models, we aim to identify those athletes whose market value and wages don’t reflect their true potential, creating an opportunity to build a cost-effective dream team.
• The first step involves fitting a linear model with player market value as the target variable, utilizing key player attributes like Overall, Potential, Attacking, and more. Through an analysis of the residuals, we will identify players with negative residuals, meaning their actual market value is lower than what the model predicts—signaling undervalued players.
• Once this group of undervalued players is identified, we will apply a second linear regression model focusing on player wages. By examining the wage residuals, we will pinpoint players who are also underpaid, based on the model’s predictions.
In the end, this dual residual analysis allows us to construct a moneyball-inspired team, comprised of players who offer maximum performance for minimal cost, taking advantage of market inefficiencies and optimizing resources to build a winning squad.
The dimension of this dataset is very high, we shall choose to drop
some columns which are not necessary using LASSO.
The columns retained are : ID, Name, International.Reputation, HeadingAccuracy, Skill.Moves, Acceleration, SprintSpeed, Reactions, Aggression, Marking, GKHandling, GKReflexes, Wage, Value, Category
We will proceed to fit a Linear Model and hence shall check the assumptions of the Linear Model to hold as practical as possible. The Assumptions are :-
• Linear Relationship
• No Heteroscedasticity
• Normality of residuals
• No Collinearity
For simplicity we have assumed Linear Relationship in the structure, to check for heteroscedasticity of residuals we shall plot the residuals and search for possible patterns.
The plot of the residual :-
Clearly we can see some patterns which doesnot form a band
around 0 in the first plot. After LogTransformation on the response we
get a better structure for the residuals.
Though the plot formed a band about 0, but we can clearly observe some disturbances in the initial part of x-axis. Those values might be due to the presence of influential outliers , we shall plot an Influence Plot to check and remove if those outliers are influential.
The corresponding Influence Plot based on Cooks Distance is :
-
The Top 10 Influential Outlier are :-
Name Value
1 Neymar 1.06e+08
2 L. Messi 8.90e+07
3 Cristiano Ronaldo 8.70e+07
4 L. Suárez 8.30e+07
5 G. Bale 7.20e+07
6 P. Pogba 7.15e+07
7 E. Hazard 7.15e+07
8 R. Lewandowski 7.10e+07
9 M. Neuer 6.95e+07
10 De Gea 6.85e+07
I guess we know why they are Influential and
Outlier?
We shall remove the influential outliers and plot the residual plot
again.
Indeed, the plot is better now as it forms a clear band about 0,
hence there is no heteroscedasticity or collinearity in between the
residual terms.
Now, in this case lets check for Normality assumption of
residuals A QQPlot of residuals :-
It absolutely satisfies this condition.
Now we shall check for multicollinearity in the columns by calculating the VIF and putting a threshold at 10.
The barplot showing the VIF values :-
GKReflexes and GKHandling are showing high VIF
values.
The correlation between GKReflexes and GKHandling is 0.9685649
As the correlation of them is very high we shall choose to drop one, in this case we are dropping GKReflexes as the model suggested GKHandling to be significant but not GKReflexes.
Call:
lm(formula = log(Value) ~ . - Name - ID - GKReflexes, data = df2)
Residuals:
Min 1Q Median 3Q Max
-2.47202 -0.40902 -0.00122 0.41027 2.20622
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.400e+00 6.427e-02 84.028 < 2e-16 ***
International.Reputation -1.258e-02 1.904e-02 -0.661 0.509
HeadingAccuracy 2.264e-02 6.151e-04 36.813 < 2e-16 ***
Skill.Moves 2.644e-01 1.062e-02 24.888 < 2e-16 ***
Acceleration 7.982e-03 9.047e-04 8.823 < 2e-16 ***
SprintSpeed 1.341e-02 9.002e-04 14.897 < 2e-16 ***
Reactions 6.041e-02 7.451e-04 81.078 < 2e-16 ***
Aggression 3.138e-03 4.739e-04 6.621 3.68e-11 ***
Marking 3.340e-03 5.067e-04 6.591 4.49e-11 ***
GKHandling 3.149e-02 1.231e-03 25.576 < 2e-16 ***
Wage 2.291e-05 4.253e-07 53.866 < 2e-16 ***
CategoryDefender -4.583e-03 2.350e-02 -0.195 0.845
CategoryGoalkeeper 4.276e-01 7.504e-02 5.698 1.23e-08 ***
CategoryMidfielder 1.700e-01 1.930e-02 8.812 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6253 on 17079 degrees of freedom
Multiple R-squared: 0.7583, Adjusted R-squared: 0.7581
F-statistic: 4121 on 13 and 17079 DF, p-value: < 2.2e-16
we shall insert the residuals based on this model in the data set and filter the data based on negative residuals as mentioned above , this filtering will give us the names of players who are under valued.
The current dataset has 8559 rows
So, we have the data set containing under-valued players and now proceed to pinpoint players who are also underpaid.
In order to do so, we are going to fit another linear model but this time the response variable will be log(Wage).
After fitting the model on this new dataset, we shall once again filter the dataset based on the negative residual values like the previous case.
The dataset now has 4304 rows
Hence, we have the dataset containing the list of players who are both Undervalued and Underpaid.
From the initial dataframe we need two columns to define a metric for cost efficiency. The columns are namely : Potential and Overall.
In order to do so , we shall merge using LEFT JOIN, on the ID column as it is unique in nature and proceed with the merged dataset to form some metric.
The cost-effectiveness metric is designed to evaluate the performance of football players relative to their cost, with the goal of identifying the most cost-effective players for building a team. The key columns required are:
• Overall: Represents the player’s current performance level.
• Potential: Indicates the player’s possible future growth or peak performance.
• Wage: The player’s weekly or yearly salary, representing the financial commitment.
• Value: The market value or transfer fee of the player.
The basic cost-effectiveness metric is formulated as:
\[ \text{Cost-Effectiveness} = \frac{\text{Overall} + \text{Potential}}{\text{Wage} + \text{Value}} \]
• Balance of Performance and Cost: This metric helps balance the current performance (Overall) and future potential (Potential) of players against the financial investment required to acquire and retain them (Wage + Value). This is crucial for assembling a team that is both high-performing and affordable, similar to the “Moneyball” philosophy.
• Focus on Value for Money: By using this metric, you can avoid overpaying for high-profile players who may not deliver performance proportional to their cost. Instead, you identify undervalued players who provide a higher return on investment in terms of their performance on the field.
• Player Selection Optimization: This metric allows for easy comparison between players of different cost and performance levels. It enables decision-making that prioritizes overall team strength while adhering to budget constraints.
While the basic metric is valuable, it can sometimes overemphasize cheaper players who may not be good enough to contribute significantly, as you observed. To address this, we developed a Weighted Cost-Effectiveness Metric :
\[ \text{Cost-Effectiveness (Weighted)} = \frac{2 \times \text{Overall} + \text{Potential}}{\text{Wage} + \text{Value}} \]
• Greater Emphasis on Current Performance: By giving more weight to the Overall score, the metric ensures that players who can contribute immediately are prioritized. This adjustment is necessary because while potential is important, players who perform well right now are more valuable for immediate results.
• Balancing Current and Future Potential: This refined metric still considers the future potential of players, but it ensures that the team isn’t overly reliant on future prospects who may not deliver immediate value.
• Avoiding the “Cheap but Ineffective” Trap: The weighted metric helps avoid a scenario where the cheapest players with low performance dominate the selection. It ensures that lower-cost players still meet a minimum performance standard, making it more aligned with the “Moneyball” approach, where the goal is to win, not just to save money.
After forming this metric and attaching this as a column in the merged file, we shall now split the dataset based on the Category Column.
[1] "Attacker" "Defender" "Goalkeeper" "Midfielder"
We have 4 datasets based on 4 positions of players :** Attacker, Midfielder, Defender and Goalkeeper.**
Finally in order to choose players we shall run a final filter on these datasets so that we can have players that are cheaper in cost but valuable in play. We shall filter the datasets for Potential values >80 and Overall values >80 and the sort the dataset based on Cost-Effectiveness_Metric column in decending order.
Selecting by Cost_Effectiveness_Metric
Selecting by Cost_Effectiveness_Metric
Selecting by Cost_Effectiveness_Metric
Selecting by Cost_Effectiveness_Metric
Here is the list of Potential Players that can be used to make the Best Cost-Effective Team.
Name Overall Potential Best.Position
1 Rubén Castro 81 81 ST
2 Aduriz 84 84 ST
3 A. Gignac 81 81 ST
4 J. Vardy 81 81 ST
5 N. Kalinić 81 81 ST
6 W. Rooney 82 82 ST
7 Falcao 82 82 ST
8 J. Hernández 82 82 ST
9 I. Slimani 82 82 ST
10 M. Mandžukić 83 83 ST
Name Overall Potential Best.Position
1 D. De Rossi 83 83 CDM
2 M. Parolo 81 81 CM
3 G. Medel 82 82 CDM
4 D. Blind 81 82 CDM
5 Quaresma 83 83 CAM
6 João Moutinho 82 82 CM
7 M. Dembélé 82 82 CM
8 A. Dzagoev 81 82 CM
9 M. Arnautović 82 82 RM
10 G. Wijnaldum 81 84 CAM
Name Overall Potential Best.Position
1 P. Evra 82 82 CB
2 Naldo 83 83 CB
3 Dani Alves 84 84 RWB
4 L. Piszczek 83 83 RB
5 L. Baines 83 83 LWB
6 A. Williams 83 83 CB
7 G. Cahill 83 83 CB
8 Bartra 81 83 CB
9 J. Vertonghen 83 83 CB
10 Azpilicueta 83 83 CB
Name Overall Potential Best.Position
1 José Reina 83 83 GK
2 A. Begović 83 83 GK
According to survey the most famous football team strategy back in 2017 was 4-2-3-1. This setup is flexible, providing solid defense and midfield control, with plenty of attacking options.
Team Formation:
Goalkeeper:
• José Reina
Defenders:
• CB: Naldo
• CB: Azpilicueta
• LB: L. Baines
• RB: L. Piszczek
Defensive Midfielders:
• CDM:D. De Rossi – strong defensive midfielder to break up play
• CDM:G. Medel – tough tackler and good at distributing the ball from deep
Attacking Midfielders:
• CAM: Quaresma – playmaker to provide creativity and flair in the attack
• LM: W. Rooney – wide playmaker, cutting inside to take shots and create chances
• RM: G. Wijnaldum – dynamic midfielder to link up with the striker and attack
Striker (ST):
• Aduriz – central striker leading the attack, good at finishing and holding the ball up
Inspired by the iconic “Moneyball” approach, this project has successfully uncovered a roster of 32 players—10 attackers, 10 midfielders, 10 defenders, and 2 goalkeepers—who offer the best blend of performance and value. By leveraging a cost-effectiveness metric that prioritizes both immediate impact and future potential, we’ve built a team that defies traditional scouting norms, proving that you don’t need to break the bank to build a winning squad.
Much like in “Moneyball”, the goal wasn’t just to find stars but to assemble a team that can deliver results while staying within financial constraints. This data-driven strategy reveals that the path to victory lies not in chasing high-priced talent but in identifying the right combination of skill, potential, and cost-efficiency.
This squad is ready to win, not just on the field, but in the financial books as well—redefining what it means to be truly “cost-effective”.