Stage 3 Final Report: Football Player Wage Prediction

Author

Kutay Polat and Efe Colak

Published

June 5, 2026

1 Economic Question

The main economic question of this project is:

Which player characteristics best predict professional football players’ wages?

This question is economically important because professional football is a high-value labor market where wages reflect performance, experience, reputation, and market demand. Understanding which characteristics are most strongly associated with wages can help clubs evaluate players, negotiate contracts, and study how labor markets reward observable productivity.

2 Data

The dataset used in this report is SalaryPrediction.csv. It contains information on professional football players and their wages. The dependent variable is Wage, a continuous numeric outcome variable representing player salary. The main explanatory variables used in the analysis are:

Age: player age
Apps: number of appearances
Caps: number of national team caps
Position: playing position
League: football league

library(tidyverse)
library(rsample)
library(yardstick)
library(broom)
library(knitr)

set.seed(465)

salary_raw <- read_csv("SalaryPrediction.csv", show_col_types = FALSE)

salary_clean <- salary_raw %>%
  mutate(
    Wage = parse_number(as.character(Wage)),
    Position = as.factor(Position),
    League = as.factor(League),
    log_wage = log(Wage)
  ) %>%
  drop_na(Wage, Age, Apps, Caps, Position, League, log_wage) %>%
  distinct()

salary_clean %>%
  summarise(
    observations = n(),
    variables = ncol(salary_clean),
    min_wage = min(Wage),
    median_wage = median(Wage),
    mean_wage = mean(Wage),
    max_wage = max(Wage)
  ) %>%
  kable(digits = 2, caption = "Dataset Overview")

Dataset Overview
observations	variables	min_wage	median_wage	mean_wage	max_wage
3842	9	1400	416000	1390327	46427000

The dataset includes professional players with very different salary levels. This makes it useful for studying wage inequality and the economic rewards associated with experience and performance-related indicators.

3 Probability and Distribution Analysis

The dependent variable Wage is continuous. In professional sports, wages are usually highly unequal: most players earn moderate wages, while a small number of elite players earn extremely high wages.

salary_clean %>%
  summarise(
    Mean = mean(Wage),
    Median = median(Wage),
    SD = sd(Wage),
    Q1 = quantile(Wage, 0.25),
    Q3 = quantile(Wage, 0.75),
    Min = min(Wage),
    Max = max(Wage)
  ) %>%
  kable(digits = 2, caption = "Summary Statistics for Wage")

Summary Statistics for Wage
Mean	Median	SD	Q1	Q3	Min	Max
1390327	416000	2605882	78000	1569500	1400	46427000

ggplot(salary_clean, aes(x = Wage)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Football Player Wages",
    x = "Wage",
    y = "Count"
  ) +
  theme_minimal()

The wage distribution is strongly right-skewed. Most players earn relatively low or moderate wages, while a small group of elite players earn very high wages. This is consistent with the superstar effect in sports labor markets, where top players receive very large salary premiums.

Because of this skewness, a logarithmic transformation was applied to wages.

ggplot(salary_clean, aes(x = log_wage)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Log Wages",
    x = "Log Wage",
    y = "Count"
  ) +
  theme_minimal()

After the log transformation, the distribution becomes more symmetric. This suggests that the original wage variable is approximately log-normal. The log wage variable is also more appropriate for linear regression because it reduces the influence of extreme salary values.

salary_clean %>%
  group_by(Position) %>%
  summarise(
    observations = n(),
    average_wage = mean(Wage),
    median_wage = median(Wage),
    average_log_wage = mean(log_wage),
    .groups = "drop"
  ) %>%
  arrange(desc(average_wage)) %>%
  kable(digits = 2, caption = "Average Wage by Position")

Average Wage by Position
Position	observations	average_wage	median_wage	average_log_wage
Midfilder	1140	1630598	559500	13.01
Forward	821	1352223	287000	12.56
Defender	1452	1332659	463000	12.85
Goalkeeper	429	1019950	231000	12.44

This table provides an initial probability-style comparison across player positions. Differences in average wages across positions suggest that role in the labor market may be related to compensation.

4 Modeling

The data were split into training and test sets using an 80/20 split. A seed was set for reproducibility.

salary_split <- initial_split(salary_clean, prop = 0.80)
salary_train <- training(salary_split)
salary_test <- testing(salary_split)

split_sizes <- tibble(
  Dataset = c("Training Set", "Test Set"),
  Sample_Size = c(nrow(salary_train), nrow(salary_test))
)

split_sizes %>%
  kable(caption = "Train/Test Split Sample Sizes")

Train/Test Split Sample Sizes
Dataset	Sample_Size
Training Set	3073
Test Set	769

Two linear regression models were estimated.

Model 1 uses basic player characteristics: age, appearances, and national team caps.

Model 2 adds position and league controls. This model is more complete because it accounts for differences in football roles and labor market environments.

model_1 <- lm(log_wage ~ Age + Apps + Caps, data = salary_train)
model_2 <- lm(log_wage ~ Age + Apps + Caps + Position + League, data = salary_train)

model_1_results <- tidy(model_1) %>%
  mutate(Model = "Model 1: Basic")

model_2_results <- tidy(model_2) %>%
  mutate(Model = "Model 2: Expanded")

bind_rows(model_1_results, model_2_results) %>%
  select(Model, term, estimate, std.error, statistic, p.value) %>%
  kable(digits = 4, caption = "Regression Coefficient Results")

Regression Coefficient Results
Model	term	estimate	std.error	statistic	p.value
Model 1: Basic	(Intercept)	12.1177	0.2574	47.0837	0.0000
Model 1: Basic	Age	-0.0311	0.0132	-2.3514	0.0188
Model 1: Basic	Apps	0.0089	0.0005	17.0883	0.0000
Model 1: Basic	Caps	0.0155	0.0014	10.8891	0.0000
Model 2: Expanded	(Intercept)	11.0091	0.2599	42.3510	0.0000
Model 2: Expanded	Age	0.0464	0.0130	3.5723	0.0004
Model 2: Expanded	Apps	0.0059	0.0005	11.5970	0.0000
Model 2: Expanded	Caps	0.0126	0.0013	9.6627	0.0000
Model 2: Expanded	PositionForward	-0.0564	0.0615	-0.9173	0.3591
Model 2: Expanded	PositionGoalkeeper	-0.4538	0.0770	-5.8898	0.0000
Model 2: Expanded	PositionMidfilder	0.1155	0.0558	2.0683	0.0387
Model 2: Expanded	LeagueLa Liga	-0.4779	0.0833	-5.7400	0.0000
Model 2: Expanded	LeagueLigue 1 Uber Eats	-0.1681	0.0883	-1.9039	0.0570
Model 2: Expanded	LeaguePremier League	0.4149	0.0767	5.4083	0.0000
Model 2: Expanded	LeaguePrimiera Liga	-1.3406	0.0802	-16.7151	0.0000
Model 2: Expanded	LeagueSerie A	-0.1405	0.0799	-1.7570	0.0790

5 Results

Predictions were made on the test set and evaluated using RMSE and R-squared.

pred_1 <- salary_test %>%
  mutate(pred_log_wage = predict(model_1, newdata = salary_test))

pred_2 <- salary_test %>%
  mutate(pred_log_wage = predict(model_2, newdata = salary_test))

metrics_1 <- tibble(
  Model = "Model 1: Basic",
  RMSE = rmse_vec(truth = pred_1$log_wage, estimate = pred_1$pred_log_wage),
  R_squared = rsq_vec(truth = pred_1$log_wage, estimate = pred_1$pred_log_wage)
)

metrics_2 <- tibble(
  Model = "Model 2: Expanded",
  RMSE = rmse_vec(truth = pred_2$log_wage, estimate = pred_2$pred_log_wage),
  R_squared = rsq_vec(truth = pred_2$log_wage, estimate = pred_2$pred_log_wage)
)

model_comparison <- bind_rows(metrics_1, metrics_2)

model_comparison %>%
  kable(digits = 4, caption = "Test Set Model Comparison")

Test Set Model Comparison
Model	RMSE	R_squared
Model 1: Basic	1.4195	0.4051
Model 2: Expanded	1.2547	0.5350

best_model_name <- model_comparison %>%
  arrange(RMSE, desc(R_squared)) %>%
  slice(1) %>%
  pull(Model)

best_model_name

[1] "Model 2: Expanded"

The preferred model is selected based on lower RMSE and higher R-squared. RMSE measures the average prediction error in log wages, while R-squared measures how much variation in log wages is explained by the model.

5.1 Cross-Validation

A 5-fold cross-validation was performed for the expanded model.

set.seed(465)
folds <- vfold_cv(salary_train, v = 5)

cv_results <- map_dfr(folds$splits, function(split) {
  analysis_data <- analysis(split)
  assessment_data <- assessment(split)
  cv_model <- lm(log_wage ~ Age + Apps + Caps + Position + League, data = analysis_data)
  cv_pred <- assessment_data %>%
    mutate(pred_log_wage = predict(cv_model, newdata = assessment_data))
  tibble(
    RMSE = rmse_vec(truth = cv_pred$log_wage, estimate = cv_pred$pred_log_wage),
    R_squared = rsq_vec(truth = cv_pred$log_wage, estimate = cv_pred$pred_log_wage)
  )
}) %>%
  mutate(Fold = paste0("Fold ", row_number())) %>%
  select(Fold, RMSE, R_squared)

cv_results %>%
  kable(digits = 4, caption = "5-Fold Cross-Validation Results")

5-Fold Cross-Validation Results
Fold	RMSE	R_squared
Fold 1	1.2753	0.5619
Fold 2	1.2169	0.5690
Fold 3	1.2822	0.5288
Fold 4	1.2001	0.5450
Fold 5	1.2638	0.5198

cv_summary <- cv_results %>%
  summarise(
    Average_RMSE = mean(RMSE),
    Average_R_squared = mean(R_squared)
  )

cv_summary %>%
  kable(digits = 4, caption = "Average Cross-Validated Performance")

Average Cross-Validated Performance
Average_RMSE	Average_R_squared
1.2476	0.5449

If the cross-validation results are close to the test set results, this suggests that the model is relatively stable. If the test performance is much better than cross-validation performance, that would suggest possible overfitting.

6 Economic Interpretation

The model results help answer the economic question: which player characteristics best predict professional football players’ wages?

Age, appearances, and national team caps are important because they measure different types of human capital and reputation. Apps reflects club-level experience, while Caps reflects international recognition. In football labor markets, players with more appearances and more national team experience are often more visible, more trusted, and more valuable to clubs.

The inclusion of Position and League is also economically meaningful. Different positions may be rewarded differently depending on market demand, scarcity, and visibility. League controls matter because football leagues differ in revenue, global audience, broadcasting income, and wage structures.

Because the dependent variable is log wage, coefficients can be interpreted approximately as percentage changes in wages. For example, a positive coefficient on Caps would suggest that additional national team experience is associated with higher expected wages, holding other variables constant.

These findings could inform business decisions by helping clubs identify which observable player characteristics are most closely related to wages. They could also support future research on wage inequality, superstar effects, and whether clubs overpay for reputation relative to actual performance.

7 Limitations and Reproducibility

7.1 Limitations

First, the dataset does not include all possible determinants of football wages. Important factors such as injuries, contract length, transfer fees, agent influence, sponsorship value, and recent performance statistics are not included.

Second, the analysis is observational. The regression results show associations, not necessarily causal effects. For example, players with more caps may earn higher wages, but this does not prove that caps directly cause higher wages.

Third, the dataset is cross-sectional. It captures players at one point in time and does not follow wage changes over multiple seasons.

7.2 Reproducibility

Several steps were taken to make the analysis reproducible:

The project uses a relative file path: SalaryPrediction.csv
set.seed(465) is used before data splitting and cross-validation
The analysis is written in a single Quarto document
All data cleaning, visualization, modeling, and evaluation steps are included in the document
The code can be re-run using quarto render as long as the CSV file is in the same folder

sessionInfo()

R version 4.5.2 (2025-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=Turkish_Türkiye.utf8  LC_CTYPE=Turkish_Türkiye.utf8   
[3] LC_MONETARY=Turkish_Türkiye.utf8 LC_NUMERIC=C                    
[5] LC_TIME=Turkish_Türkiye.utf8    

time zone: Europe/Istanbul
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] knitr_1.50      broom_1.0.10    yardstick_1.3.2 rsample_1.3.2  
 [5] lubridate_1.9.4 forcats_1.0.1   stringr_1.6.0   dplyr_1.1.4    
 [9] purrr_1.2.1     readr_2.1.6     tidyr_1.3.1     tibble_3.3.0   
[13] ggplot2_4.0.1   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] future_1.70.0      generics_0.1.4     stringi_1.8.7      listenv_0.10.1    
 [5] hms_1.1.4          digest_0.6.38      magrittr_2.0.4     evaluate_1.0.5    
 [9] grid_4.5.2         timechange_0.3.0   RColorBrewer_1.1-3 fastmap_1.2.0     
[13] jsonlite_2.0.0     backports_1.5.0    scales_1.4.0       codetools_0.2-20  
[17] cli_3.6.5          crayon_1.5.3       rlang_1.1.7        parallelly_1.46.1 
[21] bit64_4.6.0-1      withr_3.0.2        yaml_2.3.10        tools_4.5.2       
[25] parallel_4.5.2     tzdb_0.5.0         globals_0.19.1     vctrs_0.7.2       
[29] R6_2.6.1           lifecycle_1.0.4    bit_4.6.0          vroom_1.6.6       
[33] furrr_0.4.0        pkgconfig_2.0.3    pillar_1.11.1      gtable_0.3.6      
[37] glue_1.8.0         xfun_0.54          tidyselect_1.2.1   rstudioapi_0.17.1 
[41] farver_2.1.2       htmltools_0.5.8.1  labeling_0.4.3     rmarkdown_2.31    
[45] compiler_4.5.2     S7_0.2.1

8 AI Use Log

8.1 Prompt

I asked ChatGPT how to organize a Stage 3 final report using one regression dataset and how to combine probability analysis, modeling results, economic interpretation, limitations, and final reflection in a Quarto document.

8.2 How the Output Was Used

The response was used as a general outline for the report structure. The code and explanations were modified to match the actual dataset, especially the wage outcome variable and the available predictors.

8.3 Verification

The code was checked by rendering the Quarto document, confirming that the dataset loaded correctly, and reviewing the model outputs, tables, and plots.

9 Final Reflections

With more time and better data, I would improve the analysis by adding more detailed performance variables such as goals, assists, minutes played, injury history, contract length, and transfer value. These variables would likely improve prediction accuracy and create a richer economic interpretation.

A new economic question inspired by this analysis is:

Do football players from certain nationalities or leagues earn wage premiums even after controlling for experience and performance?

This question would allow future research to investigate labor market segmentation, reputation effects, and possible wage inequality in professional football.

10 Conclusion

This final report examined the determinants of professional football player wages using a regression-based predictive modeling approach. The analysis showed that wages are highly right-skewed and that a log transformation makes the outcome more suitable for modeling.

The regression models suggest that player experience, international recognition, position, and league are useful predictors of wages. Economically, this supports the idea that football wages reflect both human capital and market visibility. Although the analysis has limitations, it provides a clear example of how data science methods can be used to study wage determination in a high-value sports labor market.