Introduction

In the world of football, success on the pitch is often synonymous with star-studded lineups and hefty player price tags. But what if building a winning team wasn’t just about breaking the bank? This project embarks on a data-driven exploration of FIFA 2017 player stats, challenging the conventional approach by identifying the most cost-effective players to form a championship-worthy squad.

Leveraging the power of R and the finesse of Linear Regression, we’ll dive into key performance metrics— dribbling, passing, defense, and more—to assess their influence on a player’s market value. Our analysis goes beyond mere numbers, seeking to uncover patterns that balance performance with budget constraints. The goal? To create a statistically optimized dream team, one that maximizes talent while minimizing costs, showcasing the hidden potential within the world of football economics.

This project will provide insights not only for statisticians and data enthusiasts but also for football managers and scouts looking to extract value from the beautiful game. By transforming raw data into actionable intelligence, we aim to redefine what it means to build the best team—not just in terms of skills, but also through a financially strategic lens. Welcome to the future of football analytics, where statistics meets strategy, and value is redefined.

Exploratory Data Analysis

To have a basic idea regarding the data-set we are working with, we shall look at its basic attributes

[1] "The basic datatypes including the name of the variables are as follows:-"

Rows: 17,560
Columns: 63
$ ID                       <int> 176580, 178518, 181872, 197445, 195864, 17373…
$ Name                     <chr> " L. Suárez", " R. Nainggolan", " A. Vidal", …
$ Age                      <int> 29, 28, 29, 24, 23, 26, 25, 31, 30, 31, 30, 3…
$ Photo                    <chr> "https://cdn.sofifa.com/players/176/580/17_60…
$ Nationality              <chr> "Uruguay", "Belgium", "Chile", "Austria", "Fr…
$ Flag                     <chr> "https://cdn.sofifa.com/flags/uy.png", "https…
$ Overall                  <int> 92, 86, 87, 86, 88, 90, 82, 83, 89, 94, 86, 8…
$ Potential                <int> 92, 86, 87, 89, 94, 90, 85, 83, 89, 94, 86, 8…
$ Club                     <chr> "FC Barcelona", "Roma", "FC Bayern München", …
$ Club.Logo                <chr> "https://cdn.sofifa.com/teams/241/light_30.pn…
$ Value                    <chr> "€83M", "€37.5M", "€41.5M", "€41.5M", "€71.5M…
$ Wage                     <chr> "€525K", "€130K", "€180K", "€140K", "€225K", …
$ Special                  <int> 2291, 2290, 2285, 2279, 2271, 2266, 2264, 224…
$ Preferred.Foot           <chr> "Right", "Right", "Right", "Left", "Right", "…
$ International.Reputation <dbl> 5, 3, 4, 4, 4, 4, 3, 3, 4, 5, 3, 3, 3, 4, 3, …
$ Weak.Foot                <dbl> 4, 3, 4, 4, 4, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, …
$ Skill.Moves              <dbl> 4, 3, 3, 3, 5, 4, 3, 3, 4, 5, 4, 3, 3, 3, 3, …
$ Work.Rate                <chr> "High/ Medium", "High/ High", "High/ High", "…
$ Body.Type                <chr> "Normal (170-185)", "Stocky (170-185)", "Norm…
$ Real.Face                <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "No"…
$ Position                 <chr> "<span class=\"pos pos25\">ST", "<span class=…
$ Jersey.Number            <dbl> 9, 4, 23, 27, 6, 11, 24, 3, 19, 7, 8, 33, 18,…
$ Joined                   <chr> "Jul 11, 2014", "Jan 7, 2014", "Jul 28, 2015"…
$ Loaned.From              <chr> "nan", "nan", "nan", "nan", "nan", "nan", "na…
$ Contract.Valid.Until     <chr> "2021", "2021", "2019", "2021", "2021", "2022…
$ Height                   <chr> "6'0", "5'9", "5'11", "5'11", "6'3", "6'0", "…
$ Weight                   <chr> "190lbs", "143lbs", "165lbs", "168lbs", "185l…
$ Crossing                 <dbl> 77, 73, 76, 82, 78, 87, 77, 88, 78, 84, 76, 8…
$ Finishing                <dbl> 94, 76, 77, 63, 71, 87, 76, 70, 71, 93, 63, 6…
$ HeadingAccuracy          <dbl> 77, 59, 81, 75, 73, 86, 70, 74, 55, 85, 68, 6…
$ ShortPassing             <dbl> 83, 84, 84, 83, 85, 86, 82, 81, 92, 83, 87, 8…
$ Volleys                  <dbl> 88, 75, 78, 68, 84, 76, 87, 63, 74, 88, 76, 7…
$ Dribbling                <dbl> 86, 80, 76, 79, 89, 89, 79, 78, 86, 92, 84, 7…
$ Curve                    <dbl> 86, 73, 76, 78, 84, 86, 73, 81, 79, 81, 78, 8…
$ FKAccuracy               <dbl> 84, 68, 68, 83, 82, 85, 79, 82, 77, 76, 78, 8…
$ LongPassing              <dbl> 64, 81, 82, 80, 88, 80, 75, 74, 83, 77, 85, 8…
$ BallControl              <dbl> 91, 85, 83, 83, 90, 88, 83, 79, 92, 93, 85, 8…
$ Acceleration             <dbl> 88, 81, 77, 86, 75, 93, 86, 76, 77, 91, 79, 6…
$ SprintSpeed              <dbl> 77, 78, 74, 86, 79, 95, 84, 75, 71, 92, 79, 8…
$ Agility                  <dbl> 86, 81, 75, 82, 79, 77, 87, 75, 93, 90, 81, 7…
$ Reactions                <dbl> 93, 87, 89, 84, 86, 87, 79, 80, 88, 96, 86, 7…
$ Balance                  <dbl> 60, 84, 77, 77, 61, 65, 84, 85, 94, 63, 79, 6…
$ ShotPower                <dbl> 87, 84, 84, 83, 90, 91, 82, 84, 73, 92, 82, 8…
$ Jumping                  <dbl> 69, 76, 82, 82, 85, 85, 74, 72, 67, 95, 67, 6…
$ Stamina                  <dbl> 89, 94, 93, 87, 91, 78, 92, 86, 83, 92, 83, 8…
$ Strength                 <dbl> 80, 78, 79, 69, 91, 80, 68, 69, 58, 80, 71, 7…
$ LongShots                <dbl> 86, 86, 85, 83, 87, 90, 82, 75, 82, 90, 82, 7…
$ Aggression               <dbl> 78, 88, 91, 69, 72, 65, 79, 74, 62, 63, 78, 8…
$ Interceptions            <dbl> 41, 86, 89, 85, 70, 59, 76, 80, 76, 29, 82, 8…
$ Positioning              <dbl> 92, 88, 80, 77, 84, 86, 82, 75, 79, 94, 82, 7…
$ Vision                   <dbl> 84, 76, 80, 78, 87, 79, 75, 76, 90, 85, 84, 8…
$ Penalties                <dbl> 85, 63, 84, 80, 76, 76, 76, 90, 80, 85, 78, 8…
$ Composure                <dbl> 83, 85, 86, 79, 83, 85, 79, 81, 76, 86, 84, 8…
$ Marking                  <dbl> 30, 78, 76, 81, 68, 51, 73, 81, 66, 22, 72, 7…
$ StandingTackle           <dbl> 45, 85, 89, 83, 73, 55, 79, 82, 80, 31, 78, 7…
$ SlidingTackle            <dbl> 38, 88, 84, 83, 73, 52, 77, 82, 73, 23, 78, 7…
$ GKDiving                 <dbl> 27, 11, 4, 5, 5, 15, 9, 8, 13, 7, 4, 10, 13, …
$ GKHandling               <dbl> 25, 11, 2, 7, 6, 15, 8, 15, 9, 11, 3, 10, 15,…
$ GKKicking                <dbl> 31, 14, 4, 14, 2, 11, 10, 10, 7, 15, 4, 6, 13…
$ GKPositioning            <dbl> 33, 8, 2, 15, 4, 5, 11, 13, 14, 14, 2, 6, 6, …
$ GKReflexes               <dbl> 37, 11, 4, 9, 3, 6, 10, 12, 9, 11, 4, 12, 13,…
$ Best.Position            <chr> "ST", "CDM", "CDM", "LB", "CAM", "ST", "RB", …
$ Best.Overall.Rating      <dbl> 88, 84, 85, 84, 85, 88, 81, 81, 86, 91, 84, 8…

In this project our idea is to form the most cost-effective yet high potential team using Statistics.

Data Cleaning

Lets drop the columns not required in our Analysis i.e “Loaned From”

We shall deal with NULL/NA values in this data-set constructively as possible. The variable(s) that contains NULL / NA values are :-

[1] "Jersey.Number"

This column is not useful, so we choose to drop it. Hence, our data is free from NULL/NA Values.

Now in order to proceed we shall change the type of few columns : Wage, Value and Best Position, because they will be used in calculation and hence should go through some feature engineering to remove the unwanted parts.

We created a column “Category” that has 4 position, namely “Attacker”, “Defender”, “Midfielder”, “Goalkeeper” based on the Best.Position column.

Here is the distribution of the categories :-


  Attacker   Defender Goalkeeper Midfielder 
      3028       5913       1995       6624

For the Wage and Value column, we need to remove the € symbol and also convert K and M with suitable multiples.

Next, we shall convert the Height and Weight columns respectively to their numeric values

On basic scrutiny or birds eye view of the data-set confirms that only the numerical column(both original and newly formed) are required for the analysis, rest all columns other than those and Name shall be dropped from the data-set.

Our data is now ready for analysis, we shall begin with some visual analysis that includes basic plots and further we shall also try to check whether we can categories or form cluster based on the category column using PCA.

Principal Component Analysis

Here, we shall see whether using the numerical columns only, can we cluster the data-set into the categories we just formed.

Workflow :-

We are using 3 Principal Components.
We shall color code the category column as :

• Attacker= “RED”

• Midfielder= “GREEN”

• Defender= “BLUE”

• Goalkeeper= “PURPLE”, This will help us to notice that whether clusters are formed if the dimension is reduced to 3.

Here is the PCA Plot using 3 Components :-

3D PCA Plot .

Clearly we can see that there are clusters in the data-set which synchronizes with the category column.

Forming the best Cost-Effective Team

The goal of this project is to uncover hidden gems in football—players who are both undervalued and underpaid—using a data-driven approach. By leveraging linear regression models, we aim to identify those athletes whose market value and wages don’t reflect their true potential, creating an opportunity to build a cost-effective dream team.

• The first step involves fitting a linear model with player market value as the target variable, utilizing key player attributes like Overall, Potential, Attacking, and more. Through an analysis of the residuals, we will identify players with negative residuals, meaning their actual market value is lower than what the model predicts—signaling undervalued players.

• Once this group of undervalued players is identified, we will apply a second linear regression model focusing on player wages. By examining the wage residuals, we will pinpoint players who are also underpaid, based on the model’s predictions.

In the end, this dual residual analysis allows us to construct a moneyball-inspired team, comprised of players who offer maximum performance for minimal cost, taking advantage of market inefficiencies and optimizing resources to build a winning squad.

First Linear Model with Value as the target variable

Dimension Reduction

The dimension of this dataset is very high, we shall choose to drop some columns which are not necessary using LASSO.

The columns retained are : ID,  Name,  International.Reputation,  HeadingAccuracy,  Skill.Moves,  Acceleration,  SprintSpeed,  Reactions,  Aggression,  Marking,  GKHandling,  GKReflexes,  Wage,  Value,  Category

Linear Model Assumptions

We will proceed to fit a Linear Model and hence shall check the assumptions of the Linear Model to hold as practical as possible. The Assumptions are :-

• Linear Relationship

• No Heteroscedasticity

• Normality of residuals

• No Collinearity

For simplicity we have assumed Linear Relationship in the structure, to check for heteroscedasticity of residuals we shall plot the residuals and search for possible patterns.

The plot of the residual :-

Clearly we can see some patterns which doesnot form a band around 0 in the first plot. After LogTransformation on the response we get a better structure for the residuals.

Though the plot formed a band about 0, but we can clearly observe some disturbances in the initial part of x-axis. Those values might be due to the presence of influential outliers , we shall plot an Influence Plot to check and remove if those outliers are influential.

The corresponding Influence Plot based on Cooks Distance is : -

The Top 10 Influential Outlier are :-

                 Name    Value
1              Neymar 1.06e+08
2            L. Messi 8.90e+07
3   Cristiano Ronaldo 8.70e+07
4           L. Suárez 8.30e+07
5             G. Bale 7.20e+07
6            P. Pogba 7.15e+07
7           E. Hazard 7.15e+07
8      R. Lewandowski 7.10e+07
9            M. Neuer 6.95e+07
10             De Gea 6.85e+07

I guess we know why they are Influential and Outlier?

We shall remove the influential outliers and plot the residual plot again.

Indeed, the plot is better now as it forms a clear band about 0, hence there is no heteroscedasticity or collinearity in between the residual terms.

Now, in this case lets check for Normality assumption of residuals A QQPlot of residuals :-

It absolutely satisfies this condition.

Now we shall check for multicollinearity in the columns by calculating the VIF and putting a threshold at 10.

The barplot showing the VIF values :-

GKReflexes and GKHandling are showing high VIF values.

The correlation between GKReflexes and GKHandling is 0.9685649

As the correlation of them is very high we shall choose to drop one, in this case we are dropping GKReflexes as the model suggested GKHandling to be significant but not GKReflexes.

The Final Linear Model regarding the response variable Value


Call:
lm(formula = log(Value) ~ . - Name - ID - GKReflexes, data = df2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.47202 -0.40902 -0.00122  0.41027  2.20622 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               5.400e+00  6.427e-02  84.028  < 2e-16 ***
International.Reputation -1.258e-02  1.904e-02  -0.661    0.509    
HeadingAccuracy           2.264e-02  6.151e-04  36.813  < 2e-16 ***
Skill.Moves               2.644e-01  1.062e-02  24.888  < 2e-16 ***
Acceleration              7.982e-03  9.047e-04   8.823  < 2e-16 ***
SprintSpeed               1.341e-02  9.002e-04  14.897  < 2e-16 ***
Reactions                 6.041e-02  7.451e-04  81.078  < 2e-16 ***
Aggression                3.138e-03  4.739e-04   6.621 3.68e-11 ***
Marking                   3.340e-03  5.067e-04   6.591 4.49e-11 ***
GKHandling                3.149e-02  1.231e-03  25.576  < 2e-16 ***
Wage                      2.291e-05  4.253e-07  53.866  < 2e-16 ***
CategoryDefender         -4.583e-03  2.350e-02  -0.195    0.845    
CategoryGoalkeeper        4.276e-01  7.504e-02   5.698 1.23e-08 ***
CategoryMidfielder        1.700e-01  1.930e-02   8.812  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6253 on 17079 degrees of freedom
Multiple R-squared:  0.7583,    Adjusted R-squared:  0.7581 
F-statistic:  4121 on 13 and 17079 DF,  p-value: < 2.2e-16

we shall insert the residuals based on this model in the data set and filter the data based on negative residuals as mentioned above , this filtering will give us the names of players who are under valued.

The current dataset has  8559 rows

So, we have the data set containing under-valued players and now proceed to pinpoint players who are also underpaid.

In order to do so, we are going to fit another linear model but this time the response variable will be log(Wage).

After fitting the model on this new dataset, we shall once again filter the dataset based on the negative residual values like the previous case.

Linear model considering the Wage column as response

The dataset now has 4304 rows

Hence, we have the dataset containing the list of players who are both Undervalued and Underpaid.

From the initial dataframe we need two columns to define a metric for cost efficiency. The columns are namely : Potential and Overall.

In order to do so , we shall merge using LEFT JOIN, on the ID column as it is unique in nature and proceed with the merged dataset to form some metric.

Metric Design

Basic Metric

The cost-effectiveness metric is designed to evaluate the performance of football players relative to their cost, with the goal of identifying the most cost-effective players for building a team. The key columns required are:

• Overall: Represents the player’s current performance level.

• Potential: Indicates the player’s possible future growth or peak performance.

• Wage: The player’s weekly or yearly salary, representing the financial commitment.

• Value: The market value or transfer fee of the player.

The basic cost-effectiveness metric is formulated as:

\[ \text{Cost-Effectiveness} = \frac{\text{Overall} + \text{Potential}}{\text{Wage} + \text{Value}} \]

Why This Metric is Useful

• Balance of Performance and Cost: This metric helps balance the current performance (Overall) and future potential (Potential) of players against the financial investment required to acquire and retain them (Wage + Value). This is crucial for assembling a team that is both high-performing and affordable, similar to the “Moneyball” philosophy.

• Focus on Value for Money: By using this metric, you can avoid overpaying for high-profile players who may not deliver performance proportional to their cost. Instead, you identify undervalued players who provide a higher return on investment in terms of their performance on the field.

• Player Selection Optimization: This metric allows for easy comparison between players of different cost and performance levels. It enables decision-making that prioritizes overall team strength while adhering to budget constraints.

Weighted Metric

While the basic metric is valuable, it can sometimes overemphasize cheaper players who may not be good enough to contribute significantly, as you observed. To address this, we developed a Weighted Cost-Effectiveness Metric :

\[ \text{Cost-Effectiveness (Weighted)} = \frac{2 \times \text{Overall} + \text{Potential}}{\text{Wage} + \text{Value}} \]

Reason for the Weighted Metric

• Greater Emphasis on Current Performance: By giving more weight to the Overall score, the metric ensures that players who can contribute immediately are prioritized. This adjustment is necessary because while potential is important, players who perform well right now are more valuable for immediate results.

• Balancing Current and Future Potential: This refined metric still considers the future potential of players, but it ensures that the team isn’t overly reliant on future prospects who may not deliver immediate value.

• Avoiding the “Cheap but Ineffective” Trap: The weighted metric helps avoid a scenario where the cheapest players with low performance dominate the selection. It ensures that lower-cost players still meet a minimum performance standard, making it more aligned with the “Moneyball” approach, where the goal is to win, not just to save money.

After forming this metric and attaching this as a column in the merged file, we shall now split the dataset based on the Category Column.

[1] "Attacker"   "Defender"   "Goalkeeper" "Midfielder"

We have 4 datasets based on 4 positions of players :** Attacker, Midfielder, Defender and Goalkeeper.**

Finally in order to choose players we shall run a final filter on these datasets so that we can have players that are cheaper in cost but valuable in play. We shall filter the datasets for Potential values >80 and Overall values >80 and the sort the dataset based on Cost-Effectiveness_Metric column in decending order.

Selecting by Cost_Effectiveness_Metric
Selecting by Cost_Effectiveness_Metric
Selecting by Cost_Effectiveness_Metric
Selecting by Cost_Effectiveness_Metric

The Team

Here is the list of Potential Players that can be used to make the Best Cost-Effective Team.

Attackers : Top 10 Attackers

            Name Overall Potential Best.Position
1   Rubén Castro      81        81            ST
2         Aduriz      84        84            ST
3      A. Gignac      81        81            ST
4       J. Vardy      81        81            ST
5     N. Kalinić      81        81            ST
6      W. Rooney      82        82            ST
7         Falcao      82        82            ST
8   J. Hernández      82        82            ST
9     I. Slimani      82        82            ST
10  M. Mandžukić      83        83            ST

Midfielders : Top 10 Midfielders

             Name Overall Potential Best.Position
1     D. De Rossi      83        83           CDM
2       M. Parolo      81        81            CM
3        G. Medel      82        82           CDM
4        D. Blind      81        82           CDM
5        Quaresma      83        83           CAM
6   João Moutinho      82        82            CM
7      M. Dembélé      82        82            CM
8      A. Dzagoev      81        82            CM
9   M. Arnautović      82        82            RM
10   G. Wijnaldum      81        84           CAM

Defenders : Top 10 Defenders

             Name Overall Potential Best.Position
1         P. Evra      82        82            CB
2           Naldo      83        83            CB
3      Dani Alves      84        84           RWB
4     L. Piszczek      83        83            RB
5       L. Baines      83        83           LWB
6     A. Williams      83        83            CB
7       G. Cahill      83        83            CB
8          Bartra      81        83            CB
9   J. Vertonghen      83        83            CB
10    Azpilicueta      83        83            CB

Goalkeepers : Top 2 Goalkeepers

         Name Overall Potential Best.Position
1  José Reina      83        83            GK
2  A. Begović      83        83            GK

One Possible choice of the Team

According to survey the most famous football team strategy back in 2017 was 4-2-3-1. This setup is flexible, providing solid defense and midfield control, with plenty of attacking options.

Team Formation:

Goalkeeper:

• José Reina

Defenders:

• CB: Naldo

• CB: Azpilicueta

• LB: L. Baines

• RB: L. Piszczek

Defensive Midfielders:

• CDM:D. De Rossi – strong defensive midfielder to break up play

• CDM:G. Medel – tough tackler and good at distributing the ball from deep

Attacking Midfielders:

• CAM: Quaresma – playmaker to provide creativity and flair in the attack

• LM: W. Rooney – wide playmaker, cutting inside to take shots and create chances

• RM: G. Wijnaldum – dynamic midfielder to link up with the striker and attack

Striker (ST):

• Aduriz – central striker leading the attack, good at finishing and holding the ball up

The Visual Representation of the team on the Field :

The Best Cost-Effective Team

Conclusion

Inspired by the iconic “Moneyball” approach, this project has successfully uncovered a roster of 32 players—10 attackers, 10 midfielders, 10 defenders, and 2 goalkeepers—who offer the best blend of performance and value. By leveraging a cost-effectiveness metric that prioritizes both immediate impact and future potential, we’ve built a team that defies traditional scouting norms, proving that you don’t need to break the bank to build a winning squad.

Much like in “Moneyball”, the goal wasn’t just to find stars but to assemble a team that can deliver results while staying within financial constraints. This data-driven strategy reveals that the path to victory lies not in chasing high-priced talent but in identifying the right combination of skill, potential, and cost-efficiency.

This squad is ready to win, not just on the field, but in the financial books as well—redefining what it means to be truly “cost-effective”.

The Moneyball Paradigm : Crafting Football’s Most Efficient Team

Sayak Goswami

2024-10-24