set.seed(101)1 The Dataset
We will be working with FIFA 22 complete player dataset from Kaggle. Our focus will be on modelling the dependence of player’s wage on attributes like age, reputation, etc. and predict the wage of some players on out-of-sample data.
We will be using the FIFA player dataset from the years 2018 till 2022.
1.1 Loading Packages & Reading Data
As the first step, we will set a seed. This will ensure reproducibility of the stochastic processes happening in the analysis (like random sampling).
The library kableExtra by Hao Zhu was used to format the table display in HTML.
Code
library(dplyr)
library(tidyverse)
library(corrplot)
library(randomForest)
library(VIM)
library(mice)
library(kableExtra)
library(MASS)
library(MLmetrics)
library(xgboost)
library(fastDummies)
get_yearfile <- function (title_year) {
filename <- paste('./data/players_',title_year,'.csv', sep="")
df <- read.csv(filename)
df["fifa_version"] <- title_year
df
}
df_list <- lapply(c(18:22), get_yearfile)
player_df <- df_list %>% reduce(bind_rows)
rm(df_list)
eligible_features <- c('fifa_version', 'value_eur','wage_eur' ,'age',
'height_cm', 'weight_kg', 'league_name',
'club_contract_valid_until', 'club_joined', 'preferred_foot',
'international_reputation',
'work_rate', 'body_type', 'release_clause_eur')
player_df <- player_df[eligible_features]
head(player_df) %>%
kbl() %>%
kable_material(c("striped", "hover"))| fifa_version | value_eur | wage_eur | age | height_cm | weight_kg | league_name | club_contract_valid_until | club_joined | preferred_foot | international_reputation | work_rate | body_type | release_clause_eur |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18 | 9.55e+07 | 575000 | 32 | 185 | 80 | Spain Primera Division | 2021 | 2009-07-01 | Right | 5 | High/Low | Unique | 195800000 |
| 18 | 1.05e+08 | 575000 | 30 | 170 | 72 | Spain Primera Division | 2018 | 2004-07-01 | Left | 5 | Medium/Medium | Unique | 215300000 |
| 18 | 6.10e+07 | 225000 | 31 | 193 | 92 | German 1. Bundesliga | 2021 | 2011-07-01 | Right | 5 | Medium/Medium | Normal (185+) | 100700000 |
| 18 | 9.70e+07 | 500000 | 30 | 182 | 86 | Spain Primera Division | 2021 | 2014-07-11 | Right | 5 | High/Medium | Normal (170-185) | 198900000 |
| 18 | 1.23e+08 | 275000 | 25 | 175 | 68 | French Ligue 1 | 2022 | 2017-08-03 | Right | 5 | High/Medium | Unique | 236800000 |
| 18 | 9.20e+07 | 350000 | 28 | 185 | 79 | German 1. Bundesliga | 2021 | 2014-07-01 | Right | 4 | High/Medium | Normal (170-185) | 151800000 |
1.2 Data Cleaning
The dataframe has 92705 rows, comprising players from 57 leagues. We will select the top 7 leagues according to UEFA rankings. The following are the top 7 leagues:
- Premier League | England
- La Liga | Spain
- Serie A | Italy
- Bundesliga | Germany
- Primeira Liga | Portugal
- Ligue 1 | France
- Eredivisie | Netharlands
Code
top_leagues <- c("English Premier League", "Spain Primera Division", "Italian Serie A",
"German 1. Bundesliga", "Portuguese Liga ZON SAGRES", "French Ligue 1",
"Holland Eredivisie")
player_df <- player_df[player_df$league_name %in% top_leagues,]2 Exploratory Data Analysis
2.1 Missing Data Analysis
We can find out the columns which have missing values, with the code snippet below:
miss_cols <- names(which(colSums(is.na(player_df)) > 0))
miss_cols[1] "value_eur" "release_clause_eur"
The following columns have missing values: value_eur, release_clause_eur
Let’s check the count of missing values compared to the total number of rows, in each version of FIFA dataset:
player_df %>%
group_by(fifa_version) %>%
dplyr::summarise(release_clause_na = sum(is.na(release_clause_eur)),
value_na = sum(is.na(value_eur)), row_count=n())# A tibble: 5 x 4
fifa_version release_clause_na value_na row_count
<int> <int> <int> <int>
1 18 286 1 3949
2 19 220 1 3959
3 20 203 2 4018
4 21 166 2 4068
5 22 302 0 3973
We can visualize the missing values in the following plot:
aggr_plot <- aggr(player_df, col=c('#0099ff','#ff6699'), numbers=TRUE, sortVars=TRUE,
labels=names(player_df), cex.axis=.5, gap=3,
ylab=c("Histogram of missing data","Pattern"))We will now impute the missing values using the mice package in R
imputed_data <- mice(player_df,m=3,maxit=10,meth='cart',seed=500)Let’s compare the distribution of imputed data and the original data:
densityplot(imputed_data)Figure 3 shows the distribution of imputed data (red) over the distribution of the original data (blue). We see that they follow similar distributions. So, the imputation of our missing values hasn’t adversely affected our data distribution.
So we replace the missing values with the imputed values:
player_df <- complete(imputed_data, 1)3 Feature Engineering
3.1 Creating New Features
Based on the FIFA version, club_joined date and club_contract_valid_until date, we can create the following 2 features: contract_left_days and days_at_club
Code
player_df$fifa_version <- paste('20', player_df$fifa_version, sep="")
player_df$fifa_version <- as.Date(ISOdate(player_df$fifa_version, 1, 1))
player_df <- player_df[player_df$club_joined!="",] # since some blank values were found
player_df$club_joined <- as.Date(ISOdate(player_df$club_joined, 1, 1))
player_df$club_contract_valid_until <- as.Date(ISOdate(player_df$club_contract_valid_until, 1, 1))
player_df$contract_left_days <- as.numeric(player_df$club_contract_valid_until - player_df$fifa_version)
player_df$days_at_club <- as.numeric(player_df$fifa_version - player_df$club_joined)
head(player_df[,c("fifa_version", "club_joined", "club_contract_valid_until", "contract_left_days", "days_at_club")]) %>%
kbl() %>%
kable_material(c("striped", "hover"))| fifa_version | club_joined | club_contract_valid_until | contract_left_days | days_at_club |
|---|---|---|---|---|
| 2018-01-01 | 2009-07-01 | 2021-01-01 | 1096 | 3106 |
| 2018-01-01 | 2004-07-01 | 2018-01-01 | 0 | 4932 |
| 2018-01-01 | 2011-07-01 | 2021-01-01 | 1096 | 2376 |
| 2018-01-01 | 2014-07-11 | 2021-01-01 | 1096 | 1270 |
| 2018-01-01 | 2017-08-03 | 2022-01-01 | 1461 | 151 |
| 2018-01-01 | 2014-07-01 | 2021-01-01 | 1096 | 1280 |
Code
player_df <- subset(player_df, select = -c(fifa_version, club_contract_valid_until, club_joined))3.2 Transforming Features
The features: body_type and work_rate have string values with mixed types. That can be formatted to generate cleaner categories:
player_df$work_rate <- str_extract(player_df$work_rate, "\\w+")
player_df$body_type <- str_extract(player_df$body_type, "\\w+")The features of the data have the following types now:
Code
str(player_df)'data.frame': 18796 obs. of 13 variables:
$ value_eur : num 9.55e+07 1.05e+08 6.10e+07 9.70e+07 1.23e+08 9.20e+07 5.20e+07 7.70e+07 7.90e+07 9.05e+07 ...
$ wage_eur : num 575000 575000 225000 500000 275000 350000 300000 275000 325000 300000 ...
$ age : int 32 30 31 30 25 28 31 29 27 26 ...
$ height_cm : int 185 170 193 182 175 185 183 184 182 173 ...
$ weight_kg : int 80 72 92 86 68 79 75 87 78 76 ...
$ league_name : chr "Spain Primera Division" "Spain Primera Division" "German 1. Bundesliga" "Spain Primera Division" ...
$ preferred_foot : chr "Right" "Left" "Right" "Right" ...
$ international_reputation: int 5 5 5 5 5 4 4 4 4 4 ...
$ work_rate : chr "High" "Medium" "Medium" "High" ...
$ body_type : chr "Unique" "Unique" "Normal" "Normal" ...
$ release_clause_eur : num 1.96e+08 2.15e+08 1.01e+08 1.99e+08 2.37e+08 ...
$ contract_left_days : num 1096 0 1096 1096 1461 ...
$ days_at_club : num 3106 4932 2376 1270 151 ...
The features: league_name, preferred_foot, work_rate, body_type and international_reputation can be converted to categorical type to facilitate model fitting.
Code
player_df$league_name <- as.factor(player_df$league_name)
player_df$preferred_foot <- as.factor(player_df$preferred_foot)
player_df$work_rate <- as.factor(player_df$work_rate)
player_df$body_type <- as.factor(player_df$body_type)
player_df$international_reputation <- as.factor(player_df$international_reputation)3.3 Train-Test Split
Now the data is ready for splitting into training (including validation) and testing datasets. The test dataset will be utilized later, after designing the models, for predicting the wage on out-of-sample data.
We will select 80% of the data for training, and keep 20% as unseen data for prediction.
Code
sample <- sample.int(n = nrow(player_df), size = floor(.80*nrow(player_df)), replace = F)
train <- player_df[sample, ]
test <- player_df[-sample, ]
player_df <- train # renaming the dataframe4 Fitting Regression Models
Now we will model the dependence of Wage on other covariates. The models will be used to make prediction on the unseen data later.
4.1 Linear Model
model.lm1 <- lm(wage_eur ~ ., data=player_df)
model.lm1.summ <- summary(model.lm1)
model.lm1.summ
Call:
lm(formula = wage_eur ~ ., data = player_df)
Residuals:
Min 1Q Median 3Q Max
-245650 -5724 245 4662 239997
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.358e+04 5.074e+03 6.618 3.76e-11
value_eur 6.064e-04 5.247e-05 11.558 < 2e-16
age 8.992e+02 4.068e+01 22.106 < 2e-16
height_cm -2.098e+02 3.783e+01 -5.545 2.99e-08
weight_kg 8.909e+01 3.729e+01 2.389 0.01691
league_nameFrench Ligue 1 -2.036e+04 5.156e+02 -39.482 < 2e-16
league_nameGerman 1. Bundesliga -1.924e+04 5.262e+02 -36.563 < 2e-16
league_nameHolland Eredivisie -2.437e+04 5.734e+02 -42.501 < 2e-16
league_nameItalian Serie A -1.546e+04 5.243e+02 -29.490 < 2e-16
league_namePortuguese Liga ZON SAGRES -2.914e+04 5.580e+02 -52.228 < 2e-16
league_nameSpain Primera Division -1.959e+04 5.148e+02 -38.061 < 2e-16
preferred_footRight 9.119e+01 3.310e+02 0.275 0.78296
international_reputation2 7.478e+03 4.528e+02 16.515 < 2e-16
international_reputation3 2.394e+04 7.522e+02 31.824 < 2e-16
international_reputation4 8.325e+04 1.555e+03 53.552 < 2e-16
international_reputation5 1.511e+05 3.942e+03 38.330 < 2e-16
work_rateLow -7.862e+02 7.879e+02 -0.998 0.31837
work_rateMedium -1.011e+03 3.361e+02 -3.009 0.00263
body_typeNormal -2.519e+02 3.325e+02 -0.758 0.44872
body_typeStocky -9.086e+02 7.346e+02 -1.237 0.21616
body_typeUnique -6.258e+03 1.489e+03 -4.204 2.64e-05
release_clause_eur 6.297e-04 2.981e-05 21.122 < 2e-16
contract_left_days 3.008e+00 3.665e-01 8.206 2.47e-16
days_at_club 1.964e+00 1.984e-01 9.895 < 2e-16
(Intercept) ***
value_eur ***
age ***
height_cm ***
weight_kg *
league_nameFrench Ligue 1 ***
league_nameGerman 1. Bundesliga ***
league_nameHolland Eredivisie ***
league_nameItalian Serie A ***
league_namePortuguese Liga ZON SAGRES ***
league_nameSpain Primera Division ***
preferred_footRight
international_reputation2 ***
international_reputation3 ***
international_reputation4 ***
international_reputation5 ***
work_rateLow
work_rateMedium **
body_typeNormal
body_typeStocky
body_typeUnique ***
release_clause_eur ***
contract_left_days ***
days_at_club ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17620 on 15012 degrees of freedom
Multiple R-squared: 0.7947, Adjusted R-squared: 0.7944
F-statistic: 2527 on 23 and 15012 DF, p-value: < 2.2e-16
We see that the features: weight_kg, preferred_foot, work_rate and body_type have low signficance in modelling the wage.
So, let’s try a linear model without these factors:
model.lm2 <- lm(wage_eur ~ value_eur + age + height_cm + league_name +
international_reputation + release_clause_eur +
contract_left_days + days_at_club, data = player_df)
model.lm2.summ <- summary(model.lm2)
model.lm2.summ
Call:
lm(formula = wage_eur ~ value_eur + age + height_cm + league_name +
international_reputation + release_clause_eur + contract_left_days +
days_at_club, data = player_df)
Residuals:
Min 1Q Median 3Q Max
-247655 -5686 213 4625 239475
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.941e+04 4.055e+03 7.252 4.32e-13
value_eur 5.426e-04 4.995e-05 10.862 < 2e-16
age 9.239e+02 3.947e+01 23.406 < 2e-16
height_cm -1.581e+02 2.214e+01 -7.139 9.85e-13
league_nameFrench Ligue 1 -2.039e+04 5.153e+02 -39.562 < 2e-16
league_nameGerman 1. Bundesliga -1.910e+04 5.239e+02 -36.456 < 2e-16
league_nameHolland Eredivisie -2.446e+04 5.697e+02 -42.932 < 2e-16
league_nameItalian Serie A -1.543e+04 5.237e+02 -29.473 < 2e-16
league_namePortuguese Liga ZON SAGRES -2.924e+04 5.555e+02 -52.645 < 2e-16
league_nameSpain Primera Division -1.969e+04 5.131e+02 -38.381 < 2e-16
international_reputation2 7.662e+03 4.517e+02 16.962 < 2e-16
international_reputation3 2.388e+04 7.520e+02 31.756 < 2e-16
international_reputation4 8.285e+04 1.551e+03 53.404 < 2e-16
international_reputation5 1.484e+05 3.881e+03 38.237 < 2e-16
release_clause_eur 6.558e-04 2.927e-05 22.404 < 2e-16
contract_left_days 3.194e+00 3.651e-01 8.747 < 2e-16
days_at_club 1.969e+00 1.984e-01 9.922 < 2e-16
(Intercept) ***
value_eur ***
age ***
height_cm ***
league_nameFrench Ligue 1 ***
league_nameGerman 1. Bundesliga ***
league_nameHolland Eredivisie ***
league_nameItalian Serie A ***
league_namePortuguese Liga ZON SAGRES ***
league_nameSpain Primera Division ***
international_reputation2 ***
international_reputation3 ***
international_reputation4 ***
international_reputation5 ***
release_clause_eur ***
contract_left_days ***
days_at_club ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17640 on 15019 degrees of freedom
Multiple R-squared: 0.7943, Adjusted R-squared: 0.7941
F-statistic: 3624 on 16 and 15019 DF, p-value: < 2.2e-16
We see that the F-statistic has improved from 2527 to 3624, while Adjusted R-squared reduced from 0.7944 to 0.7941.
We see that the F-statistic has improved substantially, without much reduction in Adjusted R-squared value. So the model got better.
4.2 Generalized Linear Models
Since we are working with currency data (wage, value and release_clause), we cannot assume a normal distribution of the covariates (since they are non-negative and dense at low values). Let us plot the univariate distributions:
We can see that the currency variables (wage_eur, value_eur and release_clause_eur) are highly positively skewed, and far from normally distributed. Thus we should view them in a log-scale. The distribution after log transformation are shown below:
4.2.1 Log transformed variables
player_df$wage_log <- log(player_df$wage_eur)
player_df$value_log <- log(player_df$value_eur)
player_df$release_clause_log <- log(player_df$release_clause_eur)We can see that the positive skewness has been damped, and the log transformed variables are approximately gaussian.
4.2.2 Gamma Distributed Response
If we log-transform the response variable (Wage), we will no longer be able to compare the model to the previous linear models where non-transformed Wage was used. It will also introduce extra bias due to nonlinear back-transformation with exponential function. Thus, we will keep the response variable in the original form, and rather fit a GLM with a family corresponding to its distribution.
Code
k_wage <- player_df$wage_eur/1000
ggplot(player_df, aes(x=k_wage)) +
stat_density(colour="#ff6699", geom="line", position="identity", size=1) +
stat_function(color="#0099ff", fun=dgamma, args=list(shape=mean(k_wage)^2 / sd(k_wage)^2, scale=sd(k_wage)^2 / mean(k_wage)), size=1) +
labs(title="Gamma distribution of Wage",
x ="Wage (in 1000 Eur)", y = "Density")Thus, we will fit a GLM with Gamma family, and with a log-link.
Code
model.glm1 <- glm(wage_eur ~ value_log + age + height_cm + league_name +
international_reputation + release_clause_log +
contract_left_days + days_at_club, data = player_df,
family = Gamma(link="log"))
model.glm1
Call: glm(formula = wage_eur ~ value_log + age + height_cm + league_name +
international_reputation + release_clause_log + contract_left_days +
days_at_club, family = Gamma(link = "log"), data = player_df)
Coefficients:
(Intercept) value_log
2.653e-01 2.161e-01
age height_cm
6.967e-02 -4.365e-03
league_nameFrench Ligue 1 league_nameGerman 1. Bundesliga
-6.132e-01 -5.229e-01
league_nameHolland Eredivisie league_nameItalian Serie A
-1.239e+00 -4.113e-01
league_namePortuguese Liga ZON SAGRES league_nameSpain Primera Division
-1.484e+00 -6.742e-01
international_reputation2 international_reputation3
3.867e-02 1.404e-01
international_reputation4 international_reputation5
4.545e-01 6.118e-01
release_clause_log contract_left_days
3.651e-01 6.394e-05
days_at_club
4.784e-05
Degrees of Freedom: 15035 Total (i.e. Null); 15019 Residual
Null Deviance: 22430
Residual Deviance: 2562 AIC: 300900
If we compare the AIC of the GLM with the previous Linear Model, we see:
- AIC of GLM (with Gamma family and log link) : 300907
- AIC of Linear Model 2 : 336730.2
Thus, the GLM has given a better result (lower AIC).
4.2.3 AIC-based step function for variable selection
We can see if the variables we initially omitted after linear model, may be added to the GLM, to reduce the AIC even further. We will start off with only value_log as predictor, and keep adding / removing covariates that reduce AIC.
Code
model.glm1.step <- step(glm(wage_eur ~ value_log, data=player_df, family = Gamma(link="log")),
scope = wage_eur ~ value_log + age + weight_kg + height_cm +
days_at_club + contract_left_days + preferred_foot +
international_reputation + work_rate + body_type + release_clause_log +
league_name,
direction = "both") Start: AIC=317023.6
wage_eur ~ value_log
Df Deviance AIC
+ league_name 6 4142.2 311766
+ age 1 5725.2 314546
+ international_reputation 4 6394.8 315733
+ days_at_club 1 6710.5 316284
+ contract_left_days 1 6912.4 316639
+ weight_kg 1 6977.4 316754
+ release_clause_log 1 7010.9 316813
+ height_cm 1 7096.9 316965
+ body_type 3 7113.1 316997
+ work_rate 2 7116.4 317001
+ preferred_foot 1 7129.4 317022
<none> 7131.5 317024
- value_log 1 22426.9 343985
Step: AIC=308375.2
wage_eur ~ value_log + league_name
Df Deviance AIC
+ age 1 2712.7 303756
+ international_reputation 4 3692.8 306930
+ contract_left_days 1 3927.5 307683
+ days_at_club 1 3943.1 307734
+ weight_kg 1 4061.6 308117
+ body_type 3 4103.2 308255
+ release_clause_log 1 4132.4 308345
+ height_cm 1 4134.2 308351
+ work_rate 2 4136.2 308360
+ preferred_foot 1 4141.1 308374
<none> 4142.2 308375
- league_name 6 7131.5 318027
- value_log 1 14989.6 343442
Step: AIC=301775.6
wage_eur ~ value_log + league_name + age
Df Deviance AIC
+ release_clause_log 1 2651.0 301468
+ international_reputation 4 2655.9 301498
+ days_at_club 1 2692.9 301678
+ height_cm 1 2700.6 301717
+ weight_kg 1 2707.9 301754
+ work_rate 2 2707.9 301755
+ body_type 3 2707.6 301756
+ contract_left_days 1 2710.0 301764
+ preferred_foot 1 2712.0 301774
<none> 2712.7 301776
- age 1 4142.2 308954
- league_name 6 5725.2 316895
- value_log 1 12195.5 349404
Step: AIC=301421.4
wage_eur ~ value_log + league_name + age + release_clause_log
Df Deviance AIC
+ international_reputation 4 2593.8 301136
+ days_at_club 1 2628.2 301307
+ height_cm 1 2639.9 301367
+ body_type 3 2643.6 301389
+ work_rate 2 2646.0 301400
+ weight_kg 1 2646.9 301402
+ contract_left_days 1 2649.7 301417
+ preferred_foot 1 2650.1 301419
<none> 2651.0 301421
- value_log 1 2681.4 301576
- release_clause_log 1 2712.7 301736
- age 1 4132.4 309023
- league_name 6 5485.3 315957
Step: AIC=301092.2
wage_eur ~ value_log + league_name + age + release_clause_log +
international_reputation
Df Deviance AIC
+ days_at_club 1 2581.1 301026
+ height_cm 1 2582.9 301036
+ weight_kg 1 2589.2 301069
+ work_rate 2 2589.1 301071
+ contract_left_days 1 2591.1 301080
+ body_type 3 2591.5 301086
+ preferred_foot 1 2593.0 301090
<none> 2593.8 301092
- value_log 1 2618.9 301224
- international_reputation 4 2651.0 301389
- release_clause_log 1 2655.9 301421
- age 1 3674.9 306858
- league_name 6 5331.1 315685
Step: AIC=301017.8
wage_eur ~ value_log + league_name + age + release_clause_log +
international_reputation + days_at_club
Df Deviance AIC
+ height_cm 1 2570.5 300963
+ contract_left_days 1 2573.7 300980
+ work_rate 2 2575.6 300993
+ weight_kg 1 2576.7 300996
+ body_type 3 2578.8 301012
+ preferred_foot 1 2580.1 301015
<none> 2581.1 301018
- days_at_club 1 2593.8 301084
- value_log 1 2604.4 301141
- international_reputation 4 2628.2 301263
- release_clause_log 1 2645.4 301361
- age 1 3577.6 306368
- league_name 6 5221.2 315186
Step: AIC=300956.3
wage_eur ~ value_log + league_name + age + release_clause_log +
international_reputation + days_at_club + height_cm
Df Deviance AIC
+ contract_left_days 1 2562.0 300912
+ work_rate 2 2568.2 300948
+ body_type 3 2568.3 300951
+ preferred_foot 1 2570.0 300956
+ weight_kg 1 2570.0 300956
<none> 2570.5 300956
- height_cm 1 2581.1 301011
- days_at_club 1 2582.9 301021
- value_log 1 2594.3 301083
- international_reputation 4 2617.6 301202
- release_clause_log 1 2633.7 301295
- age 1 3575.0 306366
- league_name 6 5221.2 315226
Step: AIC=300907
wage_eur ~ value_log + league_name + age + release_clause_log +
international_reputation + days_at_club + height_cm + contract_left_days
Df Deviance AIC
+ work_rate 2 2559.9 300900
+ body_type 3 2559.7 300901
+ preferred_foot 1 2561.5 300907
+ weight_kg 1 2561.6 300907
<none> 2562.0 300907
- contract_left_days 1 2570.5 300951
- height_cm 1 2573.7 300969
- days_at_club 1 2579.4 300999
- value_log 1 2585.2 301031
- international_reputation 4 2610.0 301159
- release_clause_log 1 2622.4 301232
- age 1 3515.4 306058
- league_name 6 5205.3 315181
Step: AIC=300898.4
wage_eur ~ value_log + league_name + age + release_clause_log +
international_reputation + days_at_club + height_cm + contract_left_days +
work_rate
Df Deviance AIC
+ body_type 3 2557.7 300893
+ weight_kg 1 2559.4 300898
<none> 2559.9 300898
+ preferred_foot 1 2559.5 300899
- work_rate 2 2562.0 300906
- contract_left_days 1 2568.2 300942
- height_cm 1 2568.3 300942
- days_at_club 1 2577.9 300994
- value_log 1 2582.5 301019
- international_reputation 4 2607.6 301149
- release_clause_log 1 2620.6 301225
- age 1 3510.4 306039
- league_name 6 5190.7 315121
Step: AIC=300891.3
wage_eur ~ value_log + league_name + age + release_clause_log +
international_reputation + days_at_club + height_cm + contract_left_days +
work_rate + body_type
Df Deviance AIC
+ weight_kg 1 2556.3 300886
<none> 2557.7 300891
+ preferred_foot 1 2557.4 300892
- body_type 3 2559.9 300897
- work_rate 2 2559.7 300898
- contract_left_days 1 2566.1 300935
- height_cm 1 2566.1 300935
- days_at_club 1 2575.7 300987
- value_log 1 2580.3 301011
- international_reputation 4 2601.4 301120
- release_clause_log 1 2617.5 301213
- age 1 3495.2 305970
- league_name 6 5155.1 314957
Step: AIC=300885
wage_eur ~ value_log + league_name + age + release_clause_log +
international_reputation + days_at_club + height_cm + contract_left_days +
work_rate + body_type + weight_kg
Df Deviance AIC
<none> 2556.3 300885
+ preferred_foot 1 2556.0 300885
- weight_kg 1 2557.7 300890
- work_rate 2 2558.4 300892
- body_type 3 2559.4 300896
- height_cm 1 2563.6 300923
- contract_left_days 1 2564.7 300928
- days_at_club 1 2574.3 300981
- value_log 1 2578.7 301005
- international_reputation 4 2599.3 301110
- release_clause_log 1 2616.3 301208
- age 1 3457.1 305770
- league_name 6 5132.2 314847
In the final model, only preferred_foot is the omitted variable. We include all other features in the GLM. Thus we form the next GLM:
Code
model.glm2 <- glm(wage_eur ~ value_log + age + height_cm + league_name +
international_reputation + release_clause_log +
contract_left_days + days_at_club + weight_kg +
work_rate + body_type, data = player_df,
family = Gamma(link="log"))
model.glm2
Call: glm(formula = wage_eur ~ value_log + age + height_cm + league_name +
international_reputation + release_clause_log + contract_left_days +
days_at_club + weight_kg + work_rate + body_type, family = Gamma(link = "log"),
data = player_df)
Coefficients:
(Intercept) value_log
4.161e-01 2.137e-01
age height_cm
6.968e-02 -5.831e-03
league_nameFrench Ligue 1 league_nameGerman 1. Bundesliga
-6.110e-01 -5.223e-01
league_nameHolland Eredivisie league_nameItalian Serie A
-1.232e+00 -4.096e-01
league_namePortuguese Liga ZON SAGRES league_nameSpain Primera Division
-1.477e+00 -6.686e-01
international_reputation2 international_reputation3
3.760e-02 1.412e-01
international_reputation4 international_reputation5
4.504e-01 6.046e-01
release_clause_log contract_left_days
3.651e-01 6.326e-05
days_at_club weight_kg
4.859e-05 2.469e-03
work_rateLow work_rateMedium
-4.343e-02 -2.540e-02
body_typeNormal body_typeStocky
-3.286e-02 -1.957e-02
body_typeUnique
-2.678e-02
Degrees of Freedom: 15035 Total (i.e. Null); 15013 Residual
Null Deviance: 22430
Residual Deviance: 2556 AIC: 300900
4.3 Random Forest Model
In the next stage, we will try to fit a random forest regression model for Wage analysis. The code is as follows:
model.rf1 <- randomForest(wage_eur ~ value_log + age + height_cm + league_name +
international_reputation + release_clause_log +
contract_left_days + days_at_club, data=player_df, ntree=100,
keep.forest=TRUE, importance=TRUE)
model.rf1
Call:
randomForest(formula = wage_eur ~ value_log + age + height_cm + league_name + international_reputation + release_clause_log + contract_left_days + days_at_club, data = player_df, ntree = 100, keep.forest = TRUE, importance = TRUE)
Type of random forest: regression
Number of trees: 100
No. of variables tried at each split: 2
Mean of squared residuals: 199624051
% Var explained: 86.79
4.3.1 Hyperparameter Optimization of Random Forest Model
One main hyperparameter we can tune in the Random Forest Model is mtry: the number of variables to randomly sample as candidates at each split. By default, it is set as p/3 where p is the number of features used in the model.
The best value of mtry can be found by finding the mtry corresponding to the minimum of OOB Error, using the code below:
bestmtry <- tuneRF(player_df,player_df$wage_eur,stepFactor = 1.2,
improve = 0.01, trace=T, plot= T)mtry = 5 OOB error = 3196052
Searching left ...
Searching right ...
mtry = 6 OOB error = 1781462
0.4426054 0.01
mtry = 7 OOB error = 1596746
0.1036878 0.01
mtry = 8 OOB error = 1510160
0.05422675 0.01
mtry = 9 OOB error = 2068916
-0.3699981 0.01
We see that best mtry value is 7. Using that in the improved Random Forest model:
model.rf2 <- randomForest(wage_eur ~ value_log + age + height_cm + league_name +
international_reputation + release_clause_log +
contract_left_days + days_at_club, data=player_df, ntree=100,
mtry=7, keep.forest=TRUE, importance=TRUE)
model.rf2
Call:
randomForest(formula = wage_eur ~ value_log + age + height_cm + league_name + international_reputation + release_clause_log + contract_left_days + days_at_club, data = player_df, ntree = 100, mtry = 7, keep.forest = TRUE, importance = TRUE)
Type of random forest: regression
Number of trees: 100
No. of variables tried at each split: 7
Mean of squared residuals: 194559801
% Var explained: 87.12
4.3.2 Visualizing variable importance
The importance of variables used in the random forest model is visualized below:
Code
ImpData <- as.data.frame(importance(model.rf2))
ImpData$Var.Names <- row.names(ImpData)
ggplot(ImpData, aes(x=Var.Names, y=`%IncMSE`)) +
geom_segment( aes(x=Var.Names, xend=Var.Names, y=0, yend=`%IncMSE`), color="skyblue") +
geom_point(aes(size = IncNodePurity), color="blue", alpha=0.6) +
theme_light() +
coord_flip() +
theme(
legend.position="bottom",
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
)Here we see 2 different metrics for assessing Variable Importance:
- %IncMSE : Increase in MSE of predictions when the variable is permuted
- IncNodePurity : Average increase in node purity on splitting using the variable
%IncMSE is the more relevant metric, and we see that league_name plays most important role in the wage : permutations (shuffling) of the league_name can result in highest changes of Wage. That is followed by the international_reputation.
4.4 XGBoost Model
Next we use XGBoost (eXtreme Gradient Boosting) algorithm to design a model for Wage prediction. We will tune the parameters: max.depth and nrounds of the gradient boosting algorithm using grid search, and evaluating MAPE on a validation dataset. The model with best parameter choices are selected for evaluating on the test set later.
We convert the categorical variables to dummy variables since the XGBoost model requires a data matrix with numerical features. This is achieved with the following code:
dummy_player_df <- dummy_cols(player_df, select_columns = c('league_name', 'international_reputation', 'work_rate', 'body_type', 'preferred_foot'),
remove_selected_columns = TRUE)Now we keep aside a portion of the data for validation, and generate data matrices for XGBoost:
Code
sample <- sample.int(n = nrow(dummy_player_df), size = floor(.80*nrow(dummy_player_df)), replace = F)
train_xgb <- dummy_player_df[sample, ]
val_xgb <- dummy_player_df[-sample, ]
train_xgb_x <- data.matrix(train_xgb[, -which(names(train_xgb) %in% c('value_eur', 'wage_eur', 'wage_log', 'release_clause_eur'))])
train_xgb_y <- train_xgb$wage_eur
val_xgb_x <- data.matrix(val_xgb[, -which(names(val_xgb) %in% c('value_eur', 'wage_eur', 'wage_log', 'release_clause_eur'))])
val_xgb_y <- val_xgb$wage_eur
xgb_train = xgb.DMatrix(data = train_xgb_x, label = train_xgb_y)
xgb_val = xgb.DMatrix(data = val_xgb_x, label = val_xgb_y)To tune the max.depth and nrounds parameters of thr XGBoost model, we create a function: grid_search_xgb_val to display the MAPE for a particular (max.depth, nrounds) input:
grid_search_xgb_val <- function(depth, rounds){
model_xgboost = xgboost(data = xgb_train, max.depth = depth, nrounds = rounds, verbose = 0)
pred_y = predict(model_xgboost, xgb_val)
cat(sprintf("Depth:%d Rounds:%d MAPE:%f\n", depth, rounds, MAPE(pred_y, val_xgb_y)*100))
}Performing grid search with various values of max.depth and nrounds:
for (depth in seq(3,12,by=1)) {
for (rounds in seq(40,160, by=20)) {
grid_search_xgb_val(depth,rounds)
}
}Depth:3 Rounds:40 MAPE:44.725384
Depth:3 Rounds:60 MAPE:44.158916
Depth:3 Rounds:80 MAPE:42.547328
Depth:3 Rounds:100 MAPE:42.157512
Depth:3 Rounds:120 MAPE:41.716177
Depth:3 Rounds:140 MAPE:41.426938
Depth:3 Rounds:160 MAPE:40.945738
Depth:4 Rounds:40 MAPE:40.024120
Depth:4 Rounds:60 MAPE:39.628906
Depth:4 Rounds:80 MAPE:38.998049
Depth:4 Rounds:100 MAPE:38.388543
Depth:4 Rounds:120 MAPE:37.999003
Depth:4 Rounds:140 MAPE:37.681227
Depth:4 Rounds:160 MAPE:37.729766
Depth:5 Rounds:40 MAPE:37.030168
Depth:5 Rounds:60 MAPE:36.148724
Depth:5 Rounds:80 MAPE:36.096646
Depth:5 Rounds:100 MAPE:36.159649
Depth:5 Rounds:120 MAPE:36.258739
Depth:5 Rounds:140 MAPE:36.374767
Depth:5 Rounds:160 MAPE:36.519599
Depth:6 Rounds:40 MAPE:35.630111
Depth:6 Rounds:60 MAPE:35.313176
Depth:6 Rounds:80 MAPE:35.253351
Depth:6 Rounds:100 MAPE:35.497423
Depth:6 Rounds:120 MAPE:35.896021
Depth:6 Rounds:140 MAPE:36.189441
Depth:6 Rounds:160 MAPE:36.856739
Depth:7 Rounds:40 MAPE:36.717486
Depth:7 Rounds:60 MAPE:36.731667
Depth:7 Rounds:80 MAPE:36.616669
Depth:7 Rounds:100 MAPE:36.937940
Depth:7 Rounds:120 MAPE:37.427555
Depth:7 Rounds:140 MAPE:38.097361
Depth:7 Rounds:160 MAPE:38.459050
Depth:8 Rounds:40 MAPE:35.905908
Depth:8 Rounds:60 MAPE:36.516768
Depth:8 Rounds:80 MAPE:37.060708
Depth:8 Rounds:100 MAPE:37.615853
Depth:8 Rounds:120 MAPE:37.817680
Depth:8 Rounds:140 MAPE:38.041971
Depth:8 Rounds:160 MAPE:38.238173
Depth:9 Rounds:40 MAPE:35.088213
Depth:9 Rounds:60 MAPE:35.760802
Depth:9 Rounds:80 MAPE:36.669496
Depth:9 Rounds:100 MAPE:37.622907
Depth:9 Rounds:120 MAPE:37.833085
Depth:9 Rounds:140 MAPE:38.113635
Depth:9 Rounds:160 MAPE:38.270711
Depth:10 Rounds:40 MAPE:35.678317
Depth:10 Rounds:60 MAPE:36.171896
Depth:10 Rounds:80 MAPE:36.548702
Depth:10 Rounds:100 MAPE:37.011470
Depth:10 Rounds:120 MAPE:37.467854
Depth:10 Rounds:140 MAPE:37.566243
Depth:10 Rounds:160 MAPE:37.759631
Depth:11 Rounds:40 MAPE:35.394137
Depth:11 Rounds:60 MAPE:36.058420
Depth:11 Rounds:80 MAPE:36.324739
Depth:11 Rounds:100 MAPE:36.643077
Depth:11 Rounds:120 MAPE:36.726062
Depth:11 Rounds:140 MAPE:36.893786
Depth:11 Rounds:160 MAPE:37.019680
Depth:12 Rounds:40 MAPE:36.311382
Depth:12 Rounds:60 MAPE:36.601673
Depth:12 Rounds:80 MAPE:36.968125
Depth:12 Rounds:100 MAPE:37.301814
Depth:12 Rounds:120 MAPE:37.420180
Depth:12 Rounds:140 MAPE:37.448679
Depth:12 Rounds:160 MAPE:37.472590
From the console output, we see that the minimum MAPE is obtained for max.depth = 7 and nrounds = 40. Thus, we will create an XGBoost model with those hyperparameters:
model.xgboost = xgboost(data = xgb_train, max.depth = 7, nrounds = 40, verbose = 0)5 Wage Prediction
Using the models we have created, we will predict the Wage on unseen data (unseen during model fitting), and compare MAPE: The Mean Absolute Percentage Error.
Code
test$wage_log <- log(test$wage_eur)
test$value_log <- log(test$value_eur)
test$release_clause_log <- log(test$release_clause_eur)5.1 Baseline Model Prediction
To get a minimum performance (baseline) for wage prediction, we can use the mean wage from training dataset, and use that as prediction in the test data.
Mean wage from training data :2.6971^{4}. We can calculate the MAPE using the following code:
MAPE(mean(player_df$wage_eur), test$wage_eur)*100[1] 421.7984
Thus, the baseline MAPE is 421.798. Any model we propose should have MAPE less than this, to be considered useful.
5.2 Linear Models Prediction
Using the first linear model (where we used all covariates), we obtain the following MAPE:
MAPE(predict(model.lm1, test), test$wage_eur)*100[1] 93.49752
Using the second linear model (with only significant features selected), we obtain the following MAPE:
MAPE(predict(model.lm2, test), test$wage_eur)*100[1] 93.25372
We see that both linear models have more than 80% MAPE. That is not a very good performance.
5.3 Generalized Linear Models Prediction
Using the first glm (without AIC-based feature selection), we get the following MAPE:
MAPE(predict(model.glm1, test, type = "response"), test$wage_eur)*100[1] 39.50448
Using the second glm (with AIC-based feature selection), we get the following MAPE:
MAPE(predict(model.glm2, test, type = "response"), test$wage_eur)*100[1] 39.5349
We see a drastic improvement in performance using the GLM model, since it used Gamma distribution to model the Wage, with a log-link. This indeed helped in improving the model.
5.4 Random Forest Model Prediction
Using the first random forest model (without hyperparameter optimization), we get the following MAPE:
MAPE(predict(model.rf1, test), test$wage_eur)*100[1] 38.32406
Using the random forest model after hyperparameter optimization, we get the following MAPE:
MAPE(predict(model.rf2, test), test$wage_eur)*100[1] 35.43236
5.5 XGBoost Model Prediction
We create dummy variables for categorical features in the test dataset, to be able to use the XGBoost model.
Code
dummy_test <- dummy_cols(test, select_columns = c('league_name', 'international_reputation', 'work_rate', 'body_type', 'preferred_foot'),
remove_selected_columns = TRUE)
test_x <- data.matrix(dummy_test[, -which(names(dummy_test) %in% c('value_eur', 'wage_eur', 'wage_log', 'release_clause_eur'))])
test_y <- test$wage_eur
xgb_test = xgb.DMatrix(data = test_x, label = test_y)Using the XGBoost model for wage prediction, we get the following MAPE:
MAPE(predict(model.xgboost, xgb_test), test$wage_eur)*100[1] 37.44334
5.6 Conclusion
Out of all the models we evaluated on the test set; the Random Forest model (after hyperparameter tuning) gave the best MAPE result. The scatter-plot of predicted wage v/s actual wage on unseen data is as shown below:
Code
ggplot(test, aes(x=predict(model.rf2, test), y=wage_eur)) +
geom_point() +
geom_abline(intercept=0, slope=1) +
labs(x='Predicted Values', y='Actual Values', title='Predicted vs. Actual Values with Random Forest Model')