Stage 3 Final Report: Football Player Wage Prediction

Author

Kutay Polat and Efe Colak

Published

June 5, 2026

1 Economic Question

The main economic question of this project is:

Which player characteristics best predict professional football players’ wages?

This question is economically important because professional football is a high-value labor market where wages reflect performance, experience, reputation, and market demand. Understanding which characteristics are most strongly associated with wages can help clubs evaluate players, negotiate contracts, and study how labor markets reward observable productivity.

2 Data

The dataset used in this report is SalaryPrediction.csv. It contains information on professional football players and their wages. The dependent variable is Wage, a continuous numeric outcome variable representing player salary. The main explanatory variables used in the analysis are:

  • Age: player age
  • Apps: number of appearances
  • Caps: number of national team caps
  • Position: playing position
  • League: football league
library(tidyverse)
library(rsample)
library(yardstick)
library(broom)
library(knitr)

set.seed(465)

salary_raw <- read_csv("SalaryPrediction.csv", show_col_types = FALSE)

salary_clean <- salary_raw %>%
  mutate(
    Wage = parse_number(as.character(Wage)),
    Position = as.factor(Position),
    League = as.factor(League),
    log_wage = log(Wage)
  ) %>%
  drop_na(Wage, Age, Apps, Caps, Position, League, log_wage) %>%
  distinct()

salary_clean %>%
  summarise(
    observations = n(),
    variables = ncol(salary_clean),
    min_wage = min(Wage),
    median_wage = median(Wage),
    mean_wage = mean(Wage),
    max_wage = max(Wage)
  ) %>%
  kable(digits = 2, caption = "Dataset Overview")
Dataset Overview
observations variables min_wage median_wage mean_wage max_wage
3842 9 1400 416000 1390327 46427000

The dataset includes professional players with very different salary levels. This makes it useful for studying wage inequality and the economic rewards associated with experience and performance-related indicators.

3 Probability and Distribution Analysis

The dependent variable Wage is continuous. In professional sports, wages are usually highly unequal: most players earn moderate wages, while a small number of elite players earn extremely high wages.

salary_clean %>%
  summarise(
    Mean = mean(Wage),
    Median = median(Wage),
    SD = sd(Wage),
    Q1 = quantile(Wage, 0.25),
    Q3 = quantile(Wage, 0.75),
    Min = min(Wage),
    Max = max(Wage)
  ) %>%
  kable(digits = 2, caption = "Summary Statistics for Wage")
Summary Statistics for Wage
Mean Median SD Q1 Q3 Min Max
1390327 416000 2605882 78000 1569500 1400 46427000
ggplot(salary_clean, aes(x = Wage)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Football Player Wages",
    x = "Wage",
    y = "Count"
  ) +
  theme_minimal()

The wage distribution is strongly right-skewed. Most players earn relatively low or moderate wages, while a small group of elite players earn very high wages. This is consistent with the superstar effect in sports labor markets, where top players receive very large salary premiums.

Because of this skewness, a logarithmic transformation was applied to wages.

ggplot(salary_clean, aes(x = log_wage)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Log Wages",
    x = "Log Wage",
    y = "Count"
  ) +
  theme_minimal()

After the log transformation, the distribution becomes more symmetric. This suggests that the original wage variable is approximately log-normal. The log wage variable is also more appropriate for linear regression because it reduces the influence of extreme salary values.

salary_clean %>%
  group_by(Position) %>%
  summarise(
    observations = n(),
    average_wage = mean(Wage),
    median_wage = median(Wage),
    average_log_wage = mean(log_wage),
    .groups = "drop"
  ) %>%
  arrange(desc(average_wage)) %>%
  kable(digits = 2, caption = "Average Wage by Position")
Average Wage by Position
Position observations average_wage median_wage average_log_wage
Midfilder 1140 1630598 559500 13.01
Forward 821 1352223 287000 12.56
Defender 1452 1332659 463000 12.85
Goalkeeper 429 1019950 231000 12.44

This table provides an initial probability-style comparison across player positions. Differences in average wages across positions suggest that role in the labor market may be related to compensation.

4 Modeling

The data were split into training and test sets using an 80/20 split. A seed was set for reproducibility.

salary_split <- initial_split(salary_clean, prop = 0.80)
salary_train <- training(salary_split)
salary_test <- testing(salary_split)

split_sizes <- tibble(
  Dataset = c("Training Set", "Test Set"),
  Sample_Size = c(nrow(salary_train), nrow(salary_test))
)

split_sizes %>%
  kable(caption = "Train/Test Split Sample Sizes")
Train/Test Split Sample Sizes
Dataset Sample_Size
Training Set 3073
Test Set 769

Two linear regression models were estimated.

Model 1 uses basic player characteristics: age, appearances, and national team caps.

Model 2 adds position and league controls. This model is more complete because it accounts for differences in football roles and labor market environments.

model_1 <- lm(log_wage ~ Age + Apps + Caps, data = salary_train)
model_2 <- lm(log_wage ~ Age + Apps + Caps + Position + League, data = salary_train)

model_1_results <- tidy(model_1) %>%
  mutate(Model = "Model 1: Basic")

model_2_results <- tidy(model_2) %>%
  mutate(Model = "Model 2: Expanded")

bind_rows(model_1_results, model_2_results) %>%
  select(Model, term, estimate, std.error, statistic, p.value) %>%
  kable(digits = 4, caption = "Regression Coefficient Results")
Regression Coefficient Results
Model term estimate std.error statistic p.value
Model 1: Basic (Intercept) 12.1177 0.2574 47.0837 0.0000
Model 1: Basic Age -0.0311 0.0132 -2.3514 0.0188
Model 1: Basic Apps 0.0089 0.0005 17.0883 0.0000
Model 1: Basic Caps 0.0155 0.0014 10.8891 0.0000
Model 2: Expanded (Intercept) 11.0091 0.2599 42.3510 0.0000
Model 2: Expanded Age 0.0464 0.0130 3.5723 0.0004
Model 2: Expanded Apps 0.0059 0.0005 11.5970 0.0000
Model 2: Expanded Caps 0.0126 0.0013 9.6627 0.0000
Model 2: Expanded PositionForward -0.0564 0.0615 -0.9173 0.3591
Model 2: Expanded PositionGoalkeeper -0.4538 0.0770 -5.8898 0.0000
Model 2: Expanded PositionMidfilder 0.1155 0.0558 2.0683 0.0387
Model 2: Expanded LeagueLa Liga -0.4779 0.0833 -5.7400 0.0000
Model 2: Expanded LeagueLigue 1 Uber Eats -0.1681 0.0883 -1.9039 0.0570
Model 2: Expanded LeaguePremier League 0.4149 0.0767 5.4083 0.0000
Model 2: Expanded LeaguePrimiera Liga -1.3406 0.0802 -16.7151 0.0000
Model 2: Expanded LeagueSerie A -0.1405 0.0799 -1.7570 0.0790

5 Results

Predictions were made on the test set and evaluated using RMSE and R-squared.

pred_1 <- salary_test %>%
  mutate(pred_log_wage = predict(model_1, newdata = salary_test))

pred_2 <- salary_test %>%
  mutate(pred_log_wage = predict(model_2, newdata = salary_test))

metrics_1 <- tibble(
  Model = "Model 1: Basic",
  RMSE = rmse_vec(truth = pred_1$log_wage, estimate = pred_1$pred_log_wage),
  R_squared = rsq_vec(truth = pred_1$log_wage, estimate = pred_1$pred_log_wage)
)

metrics_2 <- tibble(
  Model = "Model 2: Expanded",
  RMSE = rmse_vec(truth = pred_2$log_wage, estimate = pred_2$pred_log_wage),
  R_squared = rsq_vec(truth = pred_2$log_wage, estimate = pred_2$pred_log_wage)
)

model_comparison <- bind_rows(metrics_1, metrics_2)

model_comparison %>%
  kable(digits = 4, caption = "Test Set Model Comparison")
Test Set Model Comparison
Model RMSE R_squared
Model 1: Basic 1.4195 0.4051
Model 2: Expanded 1.2547 0.5350
best_model_name <- model_comparison %>%
  arrange(RMSE, desc(R_squared)) %>%
  slice(1) %>%
  pull(Model)

best_model_name
[1] "Model 2: Expanded"

The preferred model is selected based on lower RMSE and higher R-squared. RMSE measures the average prediction error in log wages, while R-squared measures how much variation in log wages is explained by the model.

5.1 Cross-Validation

A 5-fold cross-validation was performed for the expanded model.

set.seed(465)
folds <- vfold_cv(salary_train, v = 5)

cv_results <- map_dfr(folds$splits, function(split) {
  analysis_data <- analysis(split)
  assessment_data <- assessment(split)
  cv_model <- lm(log_wage ~ Age + Apps + Caps + Position + League, data = analysis_data)
  cv_pred <- assessment_data %>%
    mutate(pred_log_wage = predict(cv_model, newdata = assessment_data))
  tibble(
    RMSE = rmse_vec(truth = cv_pred$log_wage, estimate = cv_pred$pred_log_wage),
    R_squared = rsq_vec(truth = cv_pred$log_wage, estimate = cv_pred$pred_log_wage)
  )
}) %>%
  mutate(Fold = paste0("Fold ", row_number())) %>%
  select(Fold, RMSE, R_squared)

cv_results %>%
  kable(digits = 4, caption = "5-Fold Cross-Validation Results")
5-Fold Cross-Validation Results
Fold RMSE R_squared
Fold 1 1.2753 0.5619
Fold 2 1.2169 0.5690
Fold 3 1.2822 0.5288
Fold 4 1.2001 0.5450
Fold 5 1.2638 0.5198
cv_summary <- cv_results %>%
  summarise(
    Average_RMSE = mean(RMSE),
    Average_R_squared = mean(R_squared)
  )

cv_summary %>%
  kable(digits = 4, caption = "Average Cross-Validated Performance")
Average Cross-Validated Performance
Average_RMSE Average_R_squared
1.2476 0.5449

If the cross-validation results are close to the test set results, this suggests that the model is relatively stable. If the test performance is much better than cross-validation performance, that would suggest possible overfitting.

6 Economic Interpretation

The model results help answer the economic question: which player characteristics best predict professional football players’ wages?

Age, appearances, and national team caps are important because they measure different types of human capital and reputation. Apps reflects club-level experience, while Caps reflects international recognition. In football labor markets, players with more appearances and more national team experience are often more visible, more trusted, and more valuable to clubs.

The inclusion of Position and League is also economically meaningful. Different positions may be rewarded differently depending on market demand, scarcity, and visibility. League controls matter because football leagues differ in revenue, global audience, broadcasting income, and wage structures.

Because the dependent variable is log wage, coefficients can be interpreted approximately as percentage changes in wages. For example, a positive coefficient on Caps would suggest that additional national team experience is associated with higher expected wages, holding other variables constant.

These findings could inform business decisions by helping clubs identify which observable player characteristics are most closely related to wages. They could also support future research on wage inequality, superstar effects, and whether clubs overpay for reputation relative to actual performance.

7 Limitations and Reproducibility

7.1 Limitations

First, the dataset does not include all possible determinants of football wages. Important factors such as injuries, contract length, transfer fees, agent influence, sponsorship value, and recent performance statistics are not included.

Second, the analysis is observational. The regression results show associations, not necessarily causal effects. For example, players with more caps may earn higher wages, but this does not prove that caps directly cause higher wages.

Third, the dataset is cross-sectional. It captures players at one point in time and does not follow wage changes over multiple seasons.

7.2 Reproducibility

Several steps were taken to make the analysis reproducible:

  • The project uses a relative file path: SalaryPrediction.csv
  • set.seed(465) is used before data splitting and cross-validation
  • The analysis is written in a single Quarto document
  • All data cleaning, visualization, modeling, and evaluation steps are included in the document
  • The code can be re-run using quarto render as long as the CSV file is in the same folder
sessionInfo()
R version 4.5.2 (2025-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=Turkish_Türkiye.utf8  LC_CTYPE=Turkish_Türkiye.utf8   
[3] LC_MONETARY=Turkish_Türkiye.utf8 LC_NUMERIC=C                    
[5] LC_TIME=Turkish_Türkiye.utf8    

time zone: Europe/Istanbul
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] knitr_1.50      broom_1.0.10    yardstick_1.3.2 rsample_1.3.2  
 [5] lubridate_1.9.4 forcats_1.0.1   stringr_1.6.0   dplyr_1.1.4    
 [9] purrr_1.2.1     readr_2.1.6     tidyr_1.3.1     tibble_3.3.0   
[13] ggplot2_4.0.1   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] future_1.70.0      generics_0.1.4     stringi_1.8.7      listenv_0.10.1    
 [5] hms_1.1.4          digest_0.6.38      magrittr_2.0.4     evaluate_1.0.5    
 [9] grid_4.5.2         timechange_0.3.0   RColorBrewer_1.1-3 fastmap_1.2.0     
[13] jsonlite_2.0.0     backports_1.5.0    scales_1.4.0       codetools_0.2-20  
[17] cli_3.6.5          crayon_1.5.3       rlang_1.1.7        parallelly_1.46.1 
[21] bit64_4.6.0-1      withr_3.0.2        yaml_2.3.10        tools_4.5.2       
[25] parallel_4.5.2     tzdb_0.5.0         globals_0.19.1     vctrs_0.7.2       
[29] R6_2.6.1           lifecycle_1.0.4    bit_4.6.0          vroom_1.6.6       
[33] furrr_0.4.0        pkgconfig_2.0.3    pillar_1.11.1      gtable_0.3.6      
[37] glue_1.8.0         xfun_0.54          tidyselect_1.2.1   rstudioapi_0.17.1 
[41] farver_2.1.2       htmltools_0.5.8.1  labeling_0.4.3     rmarkdown_2.31    
[45] compiler_4.5.2     S7_0.2.1          

8 AI Use Log

8.1 Prompt

I asked ChatGPT how to organize a Stage 3 final report using one regression dataset and how to combine probability analysis, modeling results, economic interpretation, limitations, and final reflection in a Quarto document.

8.2 How the Output Was Used

The response was used as a general outline for the report structure. The code and explanations were modified to match the actual dataset, especially the wage outcome variable and the available predictors.

8.3 Verification

The code was checked by rendering the Quarto document, confirming that the dataset loaded correctly, and reviewing the model outputs, tables, and plots.

9 Final Reflections

With more time and better data, I would improve the analysis by adding more detailed performance variables such as goals, assists, minutes played, injury history, contract length, and transfer value. These variables would likely improve prediction accuracy and create a richer economic interpretation.

A new economic question inspired by this analysis is:

Do football players from certain nationalities or leagues earn wage premiums even after controlling for experience and performance?

This question would allow future research to investigate labor market segmentation, reputation effects, and possible wage inequality in professional football.

10 Conclusion

This final report examined the determinants of professional football player wages using a regression-based predictive modeling approach. The analysis showed that wages are highly right-skewed and that a log transformation makes the outcome more suitable for modeling.

The regression models suggest that player experience, international recognition, position, and league are useful predictors of wages. Economically, this supports the idea that football wages reflect both human capital and market visibility. Although the analysis has limitations, it provides a clear example of how data science methods can be used to study wage determination in a high-value sports labor market.