Stage 3 Final Report: Football Player Wage Prediction
Author
Kutay Polat and Efe Colak
Published
June 5, 2026
1 Economic Question
The main economic question of this project is:
Which player characteristics best predict professional football players’ wages?
This question is economically important because professional football is a high-value labor market where wages reflect performance, experience, reputation, and market demand. Understanding which characteristics are most strongly associated with wages can help clubs evaluate players, negotiate contracts, and study how labor markets reward observable productivity.
2 Data
The dataset used in this report is SalaryPrediction.csv. It contains information on professional football players and their wages. The dependent variable is Wage, a continuous numeric outcome variable representing player salary. The main explanatory variables used in the analysis are:
The dataset includes professional players with very different salary levels. This makes it useful for studying wage inequality and the economic rewards associated with experience and performance-related indicators.
3 Probability and Distribution Analysis
The dependent variable Wage is continuous. In professional sports, wages are usually highly unequal: most players earn moderate wages, while a small number of elite players earn extremely high wages.
ggplot(salary_clean, aes(x = Wage)) +geom_histogram(bins =30) +labs(title ="Distribution of Football Player Wages",x ="Wage",y ="Count" ) +theme_minimal()
The wage distribution is strongly right-skewed. Most players earn relatively low or moderate wages, while a small group of elite players earn very high wages. This is consistent with the superstar effect in sports labor markets, where top players receive very large salary premiums.
Because of this skewness, a logarithmic transformation was applied to wages.
After the log transformation, the distribution becomes more symmetric. This suggests that the original wage variable is approximately log-normal. The log wage variable is also more appropriate for linear regression because it reduces the influence of extreme salary values.
This table provides an initial probability-style comparison across player positions. Differences in average wages across positions suggest that role in the labor market may be related to compensation.
4 Modeling
The data were split into training and test sets using an 80/20 split. A seed was set for reproducibility.
Model 1 uses basic player characteristics: age, appearances, and national team caps.
Model 2 adds position and league controls. This model is more complete because it accounts for differences in football roles and labor market environments.
model_1 <-lm(log_wage ~ Age + Apps + Caps, data = salary_train)model_2 <-lm(log_wage ~ Age + Apps + Caps + Position + League, data = salary_train)model_1_results <-tidy(model_1) %>%mutate(Model ="Model 1: Basic")model_2_results <-tidy(model_2) %>%mutate(Model ="Model 2: Expanded")bind_rows(model_1_results, model_2_results) %>%select(Model, term, estimate, std.error, statistic, p.value) %>%kable(digits =4, caption ="Regression Coefficient Results")
Regression Coefficient Results
Model
term
estimate
std.error
statistic
p.value
Model 1: Basic
(Intercept)
12.1177
0.2574
47.0837
0.0000
Model 1: Basic
Age
-0.0311
0.0132
-2.3514
0.0188
Model 1: Basic
Apps
0.0089
0.0005
17.0883
0.0000
Model 1: Basic
Caps
0.0155
0.0014
10.8891
0.0000
Model 2: Expanded
(Intercept)
11.0091
0.2599
42.3510
0.0000
Model 2: Expanded
Age
0.0464
0.0130
3.5723
0.0004
Model 2: Expanded
Apps
0.0059
0.0005
11.5970
0.0000
Model 2: Expanded
Caps
0.0126
0.0013
9.6627
0.0000
Model 2: Expanded
PositionForward
-0.0564
0.0615
-0.9173
0.3591
Model 2: Expanded
PositionGoalkeeper
-0.4538
0.0770
-5.8898
0.0000
Model 2: Expanded
PositionMidfilder
0.1155
0.0558
2.0683
0.0387
Model 2: Expanded
LeagueLa Liga
-0.4779
0.0833
-5.7400
0.0000
Model 2: Expanded
LeagueLigue 1 Uber Eats
-0.1681
0.0883
-1.9039
0.0570
Model 2: Expanded
LeaguePremier League
0.4149
0.0767
5.4083
0.0000
Model 2: Expanded
LeaguePrimiera Liga
-1.3406
0.0802
-16.7151
0.0000
Model 2: Expanded
LeagueSerie A
-0.1405
0.0799
-1.7570
0.0790
5 Results
Predictions were made on the test set and evaluated using RMSE and R-squared.
The preferred model is selected based on lower RMSE and higher R-squared. RMSE measures the average prediction error in log wages, while R-squared measures how much variation in log wages is explained by the model.
5.1 Cross-Validation
A 5-fold cross-validation was performed for the expanded model.
If the cross-validation results are close to the test set results, this suggests that the model is relatively stable. If the test performance is much better than cross-validation performance, that would suggest possible overfitting.
6 Economic Interpretation
The model results help answer the economic question: which player characteristics best predict professional football players’ wages?
Age, appearances, and national team caps are important because they measure different types of human capital and reputation. Apps reflects club-level experience, while Caps reflects international recognition. In football labor markets, players with more appearances and more national team experience are often more visible, more trusted, and more valuable to clubs.
The inclusion of Position and League is also economically meaningful. Different positions may be rewarded differently depending on market demand, scarcity, and visibility. League controls matter because football leagues differ in revenue, global audience, broadcasting income, and wage structures.
Because the dependent variable is log wage, coefficients can be interpreted approximately as percentage changes in wages. For example, a positive coefficient on Caps would suggest that additional national team experience is associated with higher expected wages, holding other variables constant.
These findings could inform business decisions by helping clubs identify which observable player characteristics are most closely related to wages. They could also support future research on wage inequality, superstar effects, and whether clubs overpay for reputation relative to actual performance.
7 Limitations and Reproducibility
7.1 Limitations
First, the dataset does not include all possible determinants of football wages. Important factors such as injuries, contract length, transfer fees, agent influence, sponsorship value, and recent performance statistics are not included.
Second, the analysis is observational. The regression results show associations, not necessarily causal effects. For example, players with more caps may earn higher wages, but this does not prove that caps directly cause higher wages.
Third, the dataset is cross-sectional. It captures players at one point in time and does not follow wage changes over multiple seasons.
7.2 Reproducibility
Several steps were taken to make the analysis reproducible:
The project uses a relative file path: SalaryPrediction.csv
set.seed(465) is used before data splitting and cross-validation
The analysis is written in a single Quarto document
All data cleaning, visualization, modeling, and evaluation steps are included in the document
The code can be re-run using quarto render as long as the CSV file is in the same folder
I asked ChatGPT how to organize a Stage 3 final report using one regression dataset and how to combine probability analysis, modeling results, economic interpretation, limitations, and final reflection in a Quarto document.
8.2 How the Output Was Used
The response was used as a general outline for the report structure. The code and explanations were modified to match the actual dataset, especially the wage outcome variable and the available predictors.
8.3 Verification
The code was checked by rendering the Quarto document, confirming that the dataset loaded correctly, and reviewing the model outputs, tables, and plots.
9 Final Reflections
With more time and better data, I would improve the analysis by adding more detailed performance variables such as goals, assists, minutes played, injury history, contract length, and transfer value. These variables would likely improve prediction accuracy and create a richer economic interpretation.
A new economic question inspired by this analysis is:
Do football players from certain nationalities or leagues earn wage premiums even after controlling for experience and performance?
This question would allow future research to investigate labor market segmentation, reputation effects, and possible wage inequality in professional football.
10 Conclusion
This final report examined the determinants of professional football player wages using a regression-based predictive modeling approach. The analysis showed that wages are highly right-skewed and that a log transformation makes the outcome more suitable for modeling.
The regression models suggest that player experience, international recognition, position, and league are useful predictors of wages. Economically, this supports the idea that football wages reflect both human capital and market visibility. Although the analysis has limitations, it provides a clear example of how data science methods can be used to study wage determination in a high-value sports labor market.