Football is undoubtedly the most popular sport in the world. Just by looking at the sheer number of clubs and teams associated with the sport and their respective massive fan following, one can easily expect it to be a heavily financed business. In fact, the European football market alone is estimated to be worth more than 25 billion Euros. No wonder why every football club is spending so much time, effort and money searching for best new talents to sign in with them. They are looking for every possible way to gain an edge over the other and ultimately grab a piece of the ‘25 billion’ pie. However, scouting new players can be challenging and identifying the best among them can be trickier. A hit or a miss in this domain is very crucial for a club as it is directly associated with its triumph or loss at the tournaments and leagues. Hence, a scientific and effective method of identifying the best talents would provide the club better chance at grabbing the longer end of the stick.
In this project, we have used predictive analytics techniques to come up with the solution that could revolutionize the hiring procedure of young talents in Football. The goal of this project is to identify and analyze the various constraints that play vital role in the growth of a young player and eventually come up with a list of top twenty players whose growth is expected to be the greatest. We have based our analysis upon the attributes of the players provided by FIFA which is a console-based video game series produced by EA sports. As EA sports regularly goes through extensive research and collaborations with major leagues to collect player data, we can easily vouch for the authenticity and reliability of our data. Hence by using this data and applying correct analysis techniques, it is possible to predict which player has greater chances of growing and performing better for his club in the future.
We have acquired the data online from sofifa.com which has been a trusted repository for the database of football players for the past two decades. The source of the data available in sofifa.com, has in fact been found to be EA sports thus confirming the reliability of the origin of our data. We went through extensive cleaning process of the data thus acquired. The cleaning process included the following steps
FIFA gives each player an overall rating which is a cumulative description of all his performance attributes. We calculate the growth of the players which is simply the difference in the overalls of a player between two consecutive years. Mathematically for an individual player,
Growth = Succeeding year’s overall – Preceding year’s overall
We want to predict the growth of the players for the year 2019 using the data that we have acquired. Hence for this, we need to predict the overall attribute of the players for the year 2019 thus making Overall_2019 our response variable. The data dictionary describing all the different variables are given as follows:
knitr::kable(read.csv("table.csv"))| ï..Variable.Name | Type | Description |
|---|---|---|
| Name | factor | Name of the player |
| Age_18 | integer | The age of the player as of 2018 |
| Overall_13 | integer | The overall rating of the player in the year 2013 |
| Overall_14 | integer | The overall rating of the player in the year 2014 |
| Overall_15 | integer | The overall rating of the player in the year 2015 |
| Overall_16 | integer | The overall rating of the player in the year 2016 |
| Overall_17 | integer | The overall rating of the player in the year 2017 |
| Overall_18 | integer | The overall rating of the player in the year 2018 |
| Potential_19 | integer | The potential of the player projected by FIFA in for year 2019 |
| Previous_growth | integer | Growth in the previous year |
| Special_13 | integer | Special attribute of the player in the year 2013 |
| Special_14 | integer | Special attribute of the player in the year 2014 |
| Special_15 | integer | Special attribute of the player in the year 2015 |
| Special_16 | integer | Special attribute of the player in the year 2016 |
| Special_17 | integer | Special attribute of the player in the year 2017 |
| Special_18 | integer | Special attribute of the player in the year 2018 |
| Club_18 | factor | The club of the player in 2018 |
| Club_19 | factor | The club of the player in 2019 |
| Nationality | factor | The nationality of the player |
| Overall_19 | integer | The overall rating of the player in the year 2019 and our response variable |
TO find how player growth is distributed within the data set. We plotted the frequency of actual player growth.
load(url("https://github.com/GitSujal/FIFAPlayerDataFrom2013to2019/blob/master/merged_data_from_fifa13_to_fifa19.RData?raw=true"))
#Finding how the player growth is distributed
merged_all_years['Actual_growth'] <- merged_all_years$Overall_19 - merged_all_years$Overall_18
merged_all_years['club_change']<- with(merged_all_years, ifelse(as.character(Club_18) == as.character(Club_19), 0, 1))
merged_all_years['Previous_growth']<- merged_all_years$Overall_18 - merged_all_years$Overall_17Actual_growth_stats <- as.data.frame(table(merged_all_years$Actual_growth))
plot_ly(Actual_growth_stats, x = ~Var1, y = ~Freq, type = 'bar', name = 'Growth Distribution of Players')%>% layout(title = "Growth Distribution of Players", xaxis = list(title="Player Growth"), yaxis = list(title= "Number of Players"))Looking at the histogram for the growth distribution of the players, we can clearly see that large majority of the players did not experience positive growth. The largest number of players remained neutral in terms of growth whereas very few players experienced growth of more than 5 points. From this, we can deduce that gaining growth of more than 5 points is hard and not every player is able enough to achieve it. It also provides us a glimpse at the spectrum of the players we are searching for.
Player_Age_Stats <- as.data.frame(table(merged_all_years$Age_18))
plot_ly(Player_Age_Stats,x=~Var1, y=~Freq, type ='bar',name = 'Age Distribution of Players')%>% layout(title = "Age Distribution of Players", xaxis = list(title="Age"), yaxis = list(title= "Number of Players"))From the above graph we can see that the majority of the players active currently belong to the age group between 24-31 years with most of them being 27 years old. We can also see that the number of active players start to plummet after the age of 32 years.
average_growth <- merged_all_years %>%
group_by(Age_18) %>% summarise(Actual_growth=mean(Actual_growth))
plot_ly(average_growth,x=~Age_18, y=~Actual_growth, type ='bar',name = 'Growth Distribution against Age')%>% layout(title = "Growth Distribution against Age", xaxis = list(title="Age"), yaxis = list(title= "Growth of Players"))As we can see from the above graph, the growth of the players is maximum when they are below the age of 20. This makes sense as the players tend to grow more when they are younger while they are still in their learning phase. The growth gradually starts decreasing as the age starts increasing eventually reaching a saturation point where the growth is nil. It is all downhill after that for the older players as the possibility of growth goes negative with age.
Columns Club_18 and Club_19 representing the club the player played for year 2018 and 2019 has been merged into one column named Club_Change which represents whether the player changed the club or not. This amendment has been made because the player performance might change if a player changes a club or stays at the same club. Club change value is 1 if the player club for 2019 is different than player club from 2018 and 0 if the player stays at the same club for year 18 and 19.
club_change_stats <- as.data.frame(table(merged_all_years$club_change))
par(mfrow=c(1,2))
plot_ly(club_change_stats,labels=c("Same Club","Changed Club"),values=~Freq, type ='pie',name = 'Club Change Distribution')%>% layout(title = "Club change Distribution")club_change_growth_stats <- merged_all_years %>%
group_by(club_change)%>%
summarise(Actual_growth = mean(Actual_growth))
plot_ly(club_change_growth_stats,x=~club_change, y=~Actual_growth, type ='bar',name = 'Club Change against Growth')%>% layout(title = "Average Growth against Club Change", xaxis = list(title="Club Change"), yaxis = list(title= "Average Growth of Player"))The above pie chart indicates that among all the players considered, 61.8% chose to stay in their current club while 38.2% opted for a club change. The difference in average player growth value for club change and no club change indicates that the player growth is related in some way with the player growth and should be considered as a variable to predict the player growth.
The dataset has a large number of data where the player growth is either zero or is very low between -2 to 2. Since we wanted to build a model that predicts higher player growth, training the model with data having higher number of small changes in growth may result in model underpredicting the player growth for higher growth players. To avoid this bias we removed the player data with growth values in range -2 to 2. If a player has growth that is higher than 12, the data could be wrong and such anamolies are eliminated.
nrow(merged_all_years)## [1] 2532
merged_all_years <- merged_all_years[abs(merged_all_years$Actual_growth)<=12,]
nrow(merged_all_years)## [1] 2518
merged_all_years <- merged_all_years[abs(merged_all_years$Actual_growth)>=3,]
nrow(merged_all_years)## [1] 420
We have removed the data for players higher than age 33 as the player growth starts to become negative after certain age and those data can introduce a bias that can lead to the model underpredicting the player growth. So any player data for players of age 33 or higher has been discarded assuming that there is no major positive growth after the age of 33.
merged_all_years <- merged_all_years[ merged_all_years$Age_18 < 33 ,]
nrow(merged_all_years)## [1] 354
Preparing the data for the model and splitting the data into test and train set. We only keep the columns that is important for predicting the player growth.
#keeping only the columns necessary for the model.
keep <- c("Name","Age_18","Overall_13","Overall_14","Overall_15","Overall_16","Overall_17","Overall_18","Previous_growth","Potential_19","Special_13","Special_14","Special_15","Special_16","Special_17","Special_18","club_change","Nationality","Overall_19","Actual_growth")
data_for_model <- merged_all_years[,keep]
nrow(data_for_model)## [1] 354
The data is then splitted into test and train set in 75 to 25 ratio. A seed is set for the reproducability of the results.
## 75% of the sample size
smp_size <- floor(0.75 * nrow(data_for_model))
## set the seed to make your partition reproducible
set.seed(1234567)
train_sample <- sample(seq_len(nrow(data_for_model)), size = smp_size)
train <- data_for_model[train_sample, ]
test <- data_for_model[-train_sample, ]
nrow(train)## [1] 265
nrow(test)## [1] 89
ncol(train)## [1] 20
After splitting the data, we get 265 rows of data as train set and 89 rows of data as train set. Both train and test set have 20 columns. Out of which “Overall_19” represents the player performance for the year 2019 and thus is the response variable. The remaining 18 variables can be used as explanatory variables to build our model.
Commencing the analysis, forward selection method was applied for variable selection. The forward selection method provided us with a formula consisting of eight variables as significant variables. The model thus obtained seems to be a good fit and the variables selected by the model makes sense for logical reasoning. As we can see for predicting the player performance for year 2019, potential for the player for the year 2019 seems to be the most significant variable while the club change factor is the least significant one. The player age for the year 2018 also possesses high significance in the model along with their overall attributes for the year 2015 and 2016. The model fits into our logical assumptions and the statistics for the model are acceptable. There are seemingly no patterns in the residual plot except some linear patterns on lower half of the residuals probably indicating the negative player growth with increasing year after players reach a certain age.
#Forward Selection Method
lm.0 <- lm(Overall_19 ~ 1, data=train)
forward_model.lm <- step(lm.0, scope = ~Age_18+Overall_13+Overall_14+Overall_15+Overall_16+Overall_17+Overall_18+Previous_growth+Special_13+Special_14+Special_15+Special_16+Special_17+Special_18+Potential_19+factor(club_change), direction = "forward", trace = 0)
summary(forward_model.lm)##
## Call:
## lm(formula = Overall_19 ~ Potential_19 + Overall_16 + Overall_15 +
## Age_18 + Special_18 + Special_16 + Special_13 + factor(club_change),
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9711 -0.5065 0.2374 1.0859 3.1765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.8371679 2.3836278 -1.190 0.23504
## Potential_19 0.8040336 0.0260002 30.924 < 2e-16 ***
## Overall_16 0.1159945 0.0412946 2.809 0.00535 **
## Overall_15 0.0654998 0.0355074 1.845 0.06624 .
## Age_18 0.0905065 0.0454923 1.989 0.04771 *
## Special_18 0.0017178 0.0007534 2.280 0.02344 *
## Special_16 -0.0024366 0.0009305 -2.619 0.00935 **
## Special_13 0.0012916 0.0007136 1.810 0.07145 .
## factor(club_change)1 -0.3307040 0.2254616 -1.467 0.14366
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.784 on 256 degrees of freedom
## Multiple R-squared: 0.8896, Adjusted R-squared: 0.8861
## F-statistic: 257.8 on 8 and 256 DF, p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(forward_model.lm)press(forward_model.lm)## [1] 902.0735
Backward selection method also produces the same set of variables and the statistics are same for the model.
# Backward selection method
lm.all <- lm(Overall_19~.+factor(club_change)-club_change-Name-Nationality - Actual_growth, data= train)
backward_model.lm <- step(lm.all, direction = "backward", trace = 0)
summary(backward_model.lm)##
## Call:
## lm(formula = Overall_19 ~ Age_18 + Overall_16 + Potential_19 +
## Special_14 + Special_16 + Special_17 + factor(club_change),
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2914 -0.5176 0.2581 1.1676 3.8168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.5377486 2.3783015 -1.067 0.28695
## Age_18 0.0930299 0.0456121 2.040 0.04241 *
## Overall_16 0.1638384 0.0317149 5.166 4.8e-07 ***
## Potential_19 0.8047693 0.0257857 31.210 < 2e-16 ***
## Special_14 0.0019652 0.0009148 2.148 0.03262 *
## Special_16 -0.0029462 0.0009307 -3.166 0.00173 **
## Special_17 0.0019745 0.0008383 2.355 0.01926 *
## factor(club_change)1 -0.3484661 0.2258664 -1.543 0.12411
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.787 on 257 degrees of freedom
## Multiple R-squared: 0.8888, Adjusted R-squared: 0.8858
## F-statistic: 293.5 on 7 and 257 DF, p-value: < 2.2e-16
summary(backward_model.lm)##
## Call:
## lm(formula = Overall_19 ~ Age_18 + Overall_16 + Potential_19 +
## Special_14 + Special_16 + Special_17 + factor(club_change),
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2914 -0.5176 0.2581 1.1676 3.8168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.5377486 2.3783015 -1.067 0.28695
## Age_18 0.0930299 0.0456121 2.040 0.04241 *
## Overall_16 0.1638384 0.0317149 5.166 4.8e-07 ***
## Potential_19 0.8047693 0.0257857 31.210 < 2e-16 ***
## Special_14 0.0019652 0.0009148 2.148 0.03262 *
## Special_16 -0.0029462 0.0009307 -3.166 0.00173 **
## Special_17 0.0019745 0.0008383 2.355 0.01926 *
## factor(club_change)1 -0.3484661 0.2258664 -1.543 0.12411
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.787 on 257 degrees of freedom
## Multiple R-squared: 0.8888, Adjusted R-squared: 0.8858
## F-statistic: 293.5 on 7 and 257 DF, p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(backward_model.lm)press(backward_model.lm)## [1] 899.8516
Lasso selection model is applied for the same set of variables and from the plot, it is evident that 4-10 variables from the data explain the response variable with acceptable error rate. The best model obtained from the lasso selection was selected and the statistics for the model was tested. The model obtained from lasso selection produced better result than that of the forward and backward selection models.
#For lasso Model
response<-train$Overall_19
explanatory<- as.matrix(train[,-c(1,18,19,20)])
lasso.lm <- glmnet(explanatory,response)
lasso.lm.cv <- cv.glmnet(explanatory,response)
summary(lasso.lm.cv)## Length Class Mode
## lambda 77 -none- numeric
## cvm 77 -none- numeric
## cvsd 77 -none- numeric
## cvup 77 -none- numeric
## cvlo 77 -none- numeric
## nzero 77 -none- numeric
## name 1 -none- character
## glmnet.fit 12 elnet list
## lambda.min 1 -none- numeric
## lambda.1se 1 -none- numeric
par(mfrow=c(1,2))
plot(lasso.lm)
plot(lasso.lm.cv)After building the models, each model was tested with the test dataset. The models were used to predict the player performance for year 2019. The predicted values were then plotted against the actual values to evaluate the model. The plot below shows the predicted values on vertical axis and actual values on horizontal axis of the plot. After predicting the player performance we sorted the output based on predicted growth and selected the top 20 players with the highest actual growth. The grouped bar chart below shows the player performance for the year 2018, predicted performance for the year 2019 and actual player performance for the year 2019. Since, any player performing better than expected is also considered a success, the accuracy of predicting the top 20 players with highest growth is found to be 60%.
forward.prediction <- predict(forward_model.lm,test,interval = "predict")
test['Predicted_From_Forward'] <- round(forward.prediction[,1], digits = 0)
test['Predicted_growth_From_Forward'] <- test$Predicted_From_Forward - test$Overall_18
test <- test[order(-test$Predicted_growth_From_Forward),]
test['prediction_from_forward_true'] <- with(test, ifelse(Predicted_From_Forward <=Overall_18, 1, 0))
precentage_accuracy_forward<- sum(test$prediction_from_forward_true)/nrow(test)
precentage_accuracy_forward*100## [1] 60.67416
top20_from_forward_model <- test %>% slice(1:20)
colors <- c("red","blue")
plot_ly(test,x=~Overall_19,y=~Predicted_From_Forward, type='scatter',color=~factor(prediction_from_forward_true),colors=colors)## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot_ly(top20_from_forward_model, x = ~factor(Name), y = ~Overall_18, type = 'bar', name = 'Overall 18') %>% add_trace(top20_from_forward_model, y = ~Predicted_From_Forward, name = 'From Forward Selection Prediction') %>% add_trace(top20_from_forward_model, y = ~Overall_19, name = 'Overall 19') %>% layout(title = "Top 20 players using Forward Selection Method", yaxis = list(title = 'Overall',range=c(65,90)),xaxis=list(title="Player"), barmode = 'group')Same result was obtained by predicting the player performance using backward selection model and selecting the top 20 players with the highest growth for the year 2019. The accuracy of predicting top 20 players is again found to be 60%.
backward.prediction <- predict(backward_model.lm,test,interval = "predict")
test['Predicted_From_Backward'] <- round(backward.prediction[,1], digits = 0)
test['Predicted_growth_From_Backward'] <- test$Predicted_From_Backward - test$Overall_18
test <- test[order(-test$Predicted_growth_From_Backward),]
test['prediction_from_backward_true'] <- with(test, ifelse(Predicted_From_Backward <= Overall_18, 1, 0))
precentage_accuracy_backward<- sum(test$prediction_from_backward_true)/nrow(test)
precentage_accuracy_backward*100## [1] 59.55056
top20_from_forward_model <- test %>% slice(1:20)
colors <- c("red","blue")
plot_ly(test,x=~Overall_19,y=~Predicted_From_Backward, type='scatter',color=~factor(prediction_from_backward_true),colors=colors)## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
top20_from_backward_model <- test %>% slice(1:20)
plot_ly(top20_from_backward_model, x = ~factor(Name), y = ~Overall_18, type = 'bar', name = 'Overall 18') %>% add_trace(top20_from_backward_model, y = ~Predicted_From_Backward, name = 'From Backward Selection Prediction') %>% add_trace(top20_from_backward_model, y = ~Overall_19, name = 'Overall 19') %>% layout(title = "Top 20 players using Backward Selection Method", yaxis = list(title = 'Overall',range=c(65,90)),xaxis=list(title="Player"), barmode = 'group')The player performance for 2019 was then predicted using the best model obtained from lasso selection method. This model also has overall accuracy of 60%. However, when we used the model to predict the top 20 players with the highest potential growth, the accuracy of the model was found to be 80%.
lasso.prediction <- predict.cv.glmnet(lasso.lm.cv, s= lasso.lm.cv$lambda.1se, newx = as.matrix(test[,-c(1,18,19,20,21,22,23,24,25,26)]))
test['Predicted_From_Lasso'] <- round(lasso.prediction,digits = 0)
test['Predicted_growth_From_Lasso'] <- test$Predicted_From_Lasso - test$Overall_18
test <- test[order(-test$Predicted_growth_From_Lasso),]
test['prediction_from_lasso_true'] <- with(test, ifelse(Predicted_From_Lasso <= Overall_18, 1, 0))
precentage_accuracy_lasso<- sum(test$prediction_from_lasso_true)/nrow(test)
precentage_accuracy_lasso*100## [1] 60.67416
top20_from_forward_model <- test %>% slice(1:20)
colors <- c("red","blue")
plot_ly(test,x=~Overall_19,y=~Predicted_From_Lasso, type='scatter',color=~factor(prediction_from_lasso_true),colors=colors)## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
top20_from_Lasso_model <- test %>% slice(1:20)
plot_ly(top20_from_Lasso_model, x = ~factor(Name), y = ~Overall_18, type = 'bar', name = 'Overall 18') %>% add_trace(top20_from_Lasso_model, y = ~Predicted_From_Lasso, name = 'From lasso Selection Prediction') %>% add_trace(top20_from_Lasso_model, y = ~Overall_19, name = 'Overall 19') %>% layout(title = "Top 20 players using lasso Selection Method", yaxis = list(title = 'Overall',range=c(65,90)),xaxis=list(title="Player"), barmode = 'group')The forward and backward models produced the same set of players. However, the lasso model gave 2 different players and the prediction accuracy increased by 20%.
Based on historical player data, the model was able to predict the player performance for the upcoming year. While the model is not perfect to predict all the small changes in the player performance, it can accurately predict the players with highest potential growth. Data of the players with lower growth was eliminated from the analysis as small changes in player performance would bias the model, increasing the chances of it underpredicting the growth of players. This resulted in a model which can predict higher growth with greater accuracy which is our ultimate goal. It however fails to predict or often overpredicts the smaller player growth. While the model may fail to predict player growth for all players in the test set, it can be used to find top 20 players from the test dataset with the highest expected growth. The result was found to be 80% accurate when we used the best model from the lasso selection method for the required prediction.
While the dataset is avaialble on Kaggle.com, crawling codes are taken from Aman Shrivastava’s Github repository. The Rdata for the project is avaiable on github as both Rdata file and CSV file FIFAPlayerDataFrom2013to2019.